101 questions with a bioinformatician #16: Melissa Wilson Sayres

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Melissa Wilson Sayres is Assistant Professor of Genomics, Evolution, and Bioinformatics in the School of Life Sciences and The Biodesign Institute at Arizona State University. Her lab is interested in the evolution of sex chromosomes among other topics that relate to genome evolution and comparative genomics.

I applaud Melissa for clearly setting out both her expectations of people that join her lab in addition to listing her responsibilities to her lab members. I wish more PIs were as communicative about this, though I would add an expectation for grad students: 'I will not leave food — especially cheese — on, in, or near my computer'.

You can find out more about Melissa by visiting her (well documented) lab page, checking out her mathbionerd blog, or by following her on twitter (@mwilsonsayres). And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

That we can use computers to collect, analyze, and learn new things about our biology and evolutionary history.

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

That there isn't a straightforward way to get into it. Some people come from computer science and may feel intimidated about learning the biology. Some come from biology and are intimidated by learning to program. Some (like myself) come from some other background, and learn both! Although there are a few collegiate bioinformatics programs, it is my impression that many schools do not have the kinds of background courses that students need in order to break into bioinformatics. Many of us are self-taught

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Learn a programming language

 

100. What's your all-time favorite piece of bioinformatics software, and why?

I really like Galaxy, because it has the GUI-based format for newbies, as well as the command-line option for those who prefer it, and it makes computational biology easier to reproduce.

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

Y! Because the sex chromosomes are the most interesting (and there is no 'X' nucleotide ambiguity code).

Why I twitter

I cannot sit on the fence. I like twitter and what it offers. I have learned things I never would, built genuine relationships with international people who I would have perhaps have only met over a quick coffee at a conference. And I have changed the way I speak about science.

This post by Mark Brandon sets out nine great reasons as to why he finds Twitter so useful, many of which relate to science communication.

I believe twitter is a strong positive for science, and it is a worthwhile investment of your time.

I completely agree with just about everything he has to say. It's a good list.

101 questions with a bioinformatician #15: Karyn Meltz Steinberg


This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Karyn Meltz Steinberg is a staff scientist at The Genome Institute at Washington University ('TGI' to those in the know, but it will forever be the GSC to some of us). Prior to joining, Karyn was a postdoc in Evan Eichler's lab at the University of Washington (perhaps she is destined to head herehere, or here when time comes to move on?).

Her current position sees her work as part of the Genome Reference Consortium to improve the human reference assembly. In particular, she is involved with characteHrizing and resolving the particularly 'messy' regions of the genome that have complex genomic architecture.

You can find out more about Karyn by following her on twitter (@KMS_Meltzy). And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

I enjoy the collaborative community of bioinformatics. Although we are all working on our own research questions, the basic issues of how to process sequence data and report and annotate variants are the same. I've attended some workshops recently and have been impressed with how much people want to work together to solve these problems particularly with respect to the new reference assembly and dealing with the alternative loci.

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

File formats. Can we please agree on something, friends? (KB: see Law's First Law!)

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

I actually love the fact that I was not a bio major and that I did tons of non-sciencey activities as an undergrad. My non-traditional journey has definitely shaped who I am as a researcher. I would advise my 18 year old self to not be afraid of the command line and to learn a programming language earlier.

 

100. What's your all-time favorite piece of bioinformatics software, and why?

BEDTools. Full stop.

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

I was going to say '.' because I work on filling gaps in the human genome, but Deanna Church already took that. So, I will go with 'S' as I like a GC-rich challenge and the flexibility of being either a purine or pyrimidine.

Data access for the 1,000 Plants (1KP) project

From the abstract of a new paper in GigaScience:

The 1,000 plants (1KP) project is an international multi-disciplinary consortium that has generated transcriptome data from over 1,000 plant species, with exemplars for all of the major lineages across the Viridiplantae (green plants) clade. Here, we describe how to access the data used in a phylogenomics analysis of the first 85 species, and how to visualize our gene and species trees.

The paper doesn't provide a link to what seems to be the actual project website. They mention directories within the iPlant Collaborative project where you can access data. The project website reveals that this project can be referred to either '1000 plants', 'oneKP' or '1KP' (but not '1000P'?).

Being a pedantic kind of guy, I was curious by the paper's vague mention of 'over 1,000 plant species'. How many species exactly? The paper doesn't say. But if you go to one of the iPlant pages for 1KP, you will see this:

Altogether, we sequenced 1320 samples (from 1162 species)

So this project seems to have exceeded the boundaries suggested by its name. How about the '1.2KP' project?

Identical Classifications In Science: Some advice for Jonathan Eisen

Jonathan Eisen — a colleague at the UC Davis Genome Center — has a quandary. He came up with a name for one of his projects but now needs to consider renaming it. The problem is that ICIS (Innovating Communication in Scholarship) sounds a bit like…well you all know what it sounds like. So Jon has appealed for suggestions on how to rename their project.

He should take comfort that he may not be the only one facing this dilemma. After all, the International Cooperative ITP Study Group (ICIS) has been an ongoing collaboration between hematologists since 1997. I wonder whether they are considering a name change? Maybe Jon could also ask the folk at the International Conference on Information Systems (ICIS) who have been meeting since 1980. Or they could talk to the people that came up with the Intelligent Coin Identification System (ICIS), or the The Intensive Care Infection Score (ICIS), or the Integrated Crate Interrogation System (ICIS), or the 20 year old International Crop Information System (ICIS), or the people who named this gene.

These are just some of the academic uses of ICIS that I could find from a couple of quick searches. I expect that there are more out there. This is a reflection on one of the most primal desires of all scientists…the need to come up with an acronym or initialism for their project. This urge is all too commonly associated with the additional need to make the name 'fun' (particularly a desire to name things after animals). Acronyms can also backfire for other reasons, such as when you don't fully appreciate how it might sound in other countries.

The shorter your acronym, the more likely that it has been used by other people before you (even within the same field). My suggestion would be to consider the shocking alternative of not using an acronym at all! After all, sometimes people can come up with new names that seem to catch on.

Making genome assemblies in the year 2014

I often like to encourage students to explain their work without using any complex scientific vocabulary. If you can explain what you do to your parents or grand-parents then this is great practice for explaining your work to other scientists from outside your field.

I also encourage students to think of analogies and metaphors for their work as these can really help others to grasp difficult concepts. Yesterday, I wrote a post called Making cakes in the year 2014 which was (hopefully) an obvious attempt to explain some of the complexities and problems inherent in the field of genome assembly.

It almost feels wrong to even attempt to convert millions of ~100 bp DNA fragments into — in the case of some species — a small number of sequences that span billions of bp. Every single step in the process is fraught with errors and difficulties. Every single step is controlled by software with numerous options that are often unexplored. Every single step has many alternative pieces of software available.

If we just focus on one of the earliest steps in any modern sequencing pipeline, the need to remove adapter contamination from your sequenced reads. There are at least thirty-four different tools that can be used for this step and there are over 240 different threads on SEQanswers.com that contain the words 'trim' and 'adapter' (suggesting that this process is not straightforward, and that many people need help).

I had a look at some of these tools. The program Btrim has 12 different command-line options that can all affect how the program trims adapter sequences (it has 27 different command-line options in total). Skewer has 9 different command-line options that will affect the output of the program. The trimmer Concerti has 8 options that will also affect the output. Do we even have a good idea of what is the best way to remove adapter sequences? Maybe we need a 'trimmathon' to help test all of these tools! 

If there is a point to this post maybe it would be that genome assembly is an amazingly complex, time consuming, and fundamentally difficult problem. But even the 'little steps' that that have to be done before you even start assembling your sequences are also far from straightforward. Don't convince yourself for a moment that a single tool — with default parameters — will do all of the hard work for you.

 

 

PLOS Computational Biology: Ten Simple Rules for Writing a PLOS Ten Simple Rules Article

Is there practical advice for contributing to the Ten Simple Rules collection already available? What can we learn from the existing articles in the collection? If only there was an article with ten simple rules for writing a PLOS Ten Simple Rules article. If only that article could be peppered with insightful comments from the founder of the collection: Philip E. Bourne.

This is that article.

This is very meta. I think I will wait for the 'Ten Simple Rules for Writing a Ten Simple Rules Article about writing a PLOS Ten Simple Rules Article'.

Making cakes in the year 2014

I've been trying to make a cake. There are lots of published recipes out there for how to make this cake, but the one that I used came with only a very blurry image of what the finished cake should look like. So I really had to hope that the recipe was a good one, because I wasn't entirely sure if I would be able to tell whether it worked or not.

To get started, I used one of those online shopping services that can deliver all the ingredients to your door. Even though they claimed that they stocked everything on my shopping list, they then informed me that there were a small number of ingredients that they were not able to physically access at the moment. Frustratingly, they weren't able to tell me which ingredients would be missing when they delivered them. How odd. 

Something else that seemed unusual was that my cake recipe specified that I needed almost 100 times the amount of ingredients compared to what will end up in the finished cake. Seems a bit wasteful, but who am I to argue with the recipe?

Before I could actually start the baking process, I found that there were a few issues that I had to overcome. Lots of the ingredients had become stuck to the packaging and I had to use a tool which could separate the two. Only, some of the time it didn't get rid of all the packaging, and some of the time it ended up getting rid of not just the packaging but some of the ingredient as well. There's actually several tools on the market for doing this, but they all seem to perform slightly differently.

After I got rid of the packaging I noticed that lots of the ingredients had started to spoil and had to be thrown away, but some of them could be salvaged by cutting off the bad parts. There also seemed to be a lot of implements that you can buy to help with the cutting. Wasn't obvious which one was the best, so I used the first one that Google suggested.

At this point it was kind of frustrating to notice that a small proportion of my ingredients weren't cake ingredients at all. I had to throw them all away, but I think that some of them may have ended up in the final cake.

When it came to the actual baking, I was a bit overwhelmed by the fact that there were dozens of different manufacturers who all claimed that I could make a better cake if only I used their brand of oven. Nearly all of these ovens just let you put your raw ingredients in one slot — after you have removed packaging, the spoilt ingredients, and the non-cake ingredients — and voila, out comes your cake!

I chose one of the more popular ovens on the market and waited patiently for many hours as my cake baked happily in the oven. When the timer buzzed and I took the cake out, I was surprised to that many of the raw ingredients were left behind in the oven's 'waste overflow unit'. The real surprise however, was that the finished cake didn't really look anything like the — admittedly blurry — photo that came with the recipe. 

The cake had many different layers, but they weren't quite all the same size and some of them seemed to have been assembled in the wrong order. The pattern on the cake decoration — yes this oven also decorates the cake — was inconsistent at best. It would mostly use one color of icing, but every now and then, it would insert a different color. The same thing happened with the fillings, it would randomly switch from one flavor to another, and then back again. It was almost like there were two different cakes which had  been squished together to make a new one.

When I finally showed the cake to one my baking friends, I was hoping that he would enjoy it. However, all he kept asking me was "How big are the layers?". When I told him, he replied "My cake has bigger layers so yours can't be very good", and then he left. How rude. I took it to another friend and she just said "Your cake is smaller than mine so mine must be better". She also left without trying it. Finally, I took it to another baking colleague. Before I could show him the cake he just said "My cake has most of the common ingredients expected in all cakes, how many does yours have?". I didn't know so he left.

Making cakes is a very strange business.

How would you pronounce the name of this bioinformatics tool?

From the latest issue of Bioinformatics we have a new tool that is an R package for the analysis of GWAS studies. Rather than name the tool, I want you all to first see it exactly as it appears in the journal:

The first character in the name of this software is a character which can often be hard to identify, particularly when certain fonts makes it look like it could be the letters L or I, or even the number 1.

This is not a name that is worthy of a JABBA-award, but it does fall in to my category of posts which I call almost JABBA, for software names that have various other issues. The particular issue in this case is that the name is hard to read and therefore hard to pronounce. I feel that the use of lower-case characters makes it more likely that the reader will attempt to pronounce this as a word, rather than read it as an initialism. E.g. maybe you saw this name and read it as 'Lurgpurr', or 'Ergpurr'.

The reason behind the name is not explained in the article, but when you go to the linked software page, all is revealed:

It's a bit odd that one of the five words that appear in this name ('Gaussian') doesn't get mentioned anywhere in the paper. But more importantly, why did they feel the need for using lower-case characters? 'LRGPR' would have been much easier to read and comprehend than the font-dependent 'lrgpr'.