Twitter competition: win a signed copy of Bioinformatics Data Skills by Vince Buffalo

I have an extra copy of the fantastic Bioinformatics Data Skills book by Vince Buffalo (who you should all be following on twitter at @vsbuffalo). I've come up with a fun little competition to let someone have a chance of winning this signed copy.

All you have to do is write a tweet that includes the #ACGT hashtag (so I can track all of the answers), which provides the following information:

Name a useful bioinformatics data skill

The winner will be chosen randomly — hopefully by a powerful scripted solution that Vince will help me with — in two weeks time. I will post details of any interesting (or funny) suggestions that you come up with on this site. The full details are below.

Competition rules

  1. Tweets should name a 'useful bioinformatics data skill'
  2. Tweets must contain the hashtag #ACGT
  3. Last day to enter into the competition is 25th September (midnight PST)
  4. One winner will be chosen randomly
  5. Only one entrant per twitter account
  6. Retweets of tweets that use #ACGT hashtag will be excluded

Microsoft's vision for bioinformatics research (caution: NSFW)

I came across this disturbing image on the web today. Warning, may cause offense:

Click to enlarge

Even more disturbing was the text that accompanied the image, text that appears on Microsoft Research's flickr account (emphasis mine):

The Microsoft Biology Initiative includes several Microsoft biology tools that enable biology and bioinformatics researchers to be more productive in making scientific discoveries. One such tool, the Microsoft Research Biology Extension for Excel, displays the contents of a FASTA file containing an Influenza A virus sequence. By importing FASTA data into Excel, researchers are better able to visualize and analyze information.

The point at which you want to import FASTA files into Excel is the point at which you should probably think about quitting bioinformatics.

Get With the Program: DIY tips for adding coding to your analysis arsenal

A new article in The Scientist magazine by Jeffrey M. Perkel shares some coding advice from Cole Trapnell, C. Titus Brown, and Vince Buffalo (I interviewed Vince in my last blog post). It is a great article, and worth a look. I particularly enjoyed this piece of advice (something that is not mentioned enough):

Treat data as "read-only"
Use an abundance of caution when working with your hard-won data, Buffalo says. For instance, “treat data as read-only.” In other words, don’t work with original copies of the data, make working copies instead. “If you have the data in an Excel spreadsheet and you make a change, that original data is gone forever,” he says.

I have seen too many students double click on FASTA, GFF, and other large bioinformatics text files and end up 'viewing' them in some inappropriate program (including Microsoft Word). If you want to view text data, use a text viewer (such as less).

When 'verbose' mode is maybe a little too verbose: lessons from the Trinity transcriptome assembler

The transcriptome assembler Trinity, like many other bioinformatics command-line tools, sends its principle output (a transcriptome assembly) to a named output file. It writes other information about the status of the run to standard output.

Another feature in common with other bioinformatics programs is that provides a --verbose mode. The Trinity command-line help describes this as follows:

verbose: provide additional job status info during the run

I recently helped a colleague use Trinity to generate a primate transcriptome assembly, and when we ran the program we did two runs, one with standard logging and one with the verbose output turned on. In both cases we used file redirection to send the output to a file. So what did we end up with?

  1. transcriptome.fasta - 60.4 MB
  2. stdout.log - 2.1 MB
  3. stdout_verbose.log - 140.7 MB

The verbose log file was 70 times bigger than the standard log file, and over twice the size of the final transcriptome assembly! I tried converting the verbose text file to a PDF which gave me a 15,385 page document. The Unix word count program tells me that this file contains over 15 million 'words', but the problem is that that these are not words that you would necessarily want to read. There are thousands and thousands of pages of output with text that looks like this:

If you run Trinity without redirecting the output to a file, you will just see the percentage completion number overwrite itself on a single line of output. This doesn't work so well though if someone does choose to redirect the output to a file. You could also make an argument that no-one really needs to see such a high level of precision when reporting the state-of-completion of each step (four decimal places!).

I think this is an example where the verbose log file ends up being so big as to be largely unusable. If you wanted to search for a specific string in that file, then maybe it would be helpful. The main problem is that the Trinity developers are trying to be smart by having the program overwrite output — regarding the percentage completion status of each step — on various lines of output. However, this is only useful if the user chooses not to redirect the output to a file (something which is incredibly common in bioinformatics). I would argue that for 99% of cases, it is more than sufficient for a program to indicate 10–20 lines of output regarding the state of completion, e.g.

Calculating stage 1 of shamrock.pl…
10% complete
20% complete
30% complete
40% complete
50% complete
60% complete
70% complete
80% complete
90% complete
100% complete

Great post by Lex Nederbragt about why we need graph-based representations of genome sequences

In a recent blog post, Lex Nederbragt explains why we all need to be moving to graph-based representations of genome sequences (and getting away from linear representations).

In this post I will provide some background, explain the reasons for moving towards graph-based representations, and indicate some challenges associated with this development.

After listing the many challenges involved in moving towards a graph-based future, he refers to the fact that current efforts have not been widely adopted:

Two file formats to represent the graph been developed: FASTG and GFA. FASTG has limited uptake, only two assembly programs (ALLPATHS_LG and SPAdes) will output in that format. GFA parsing is currently only experimentally in the ABYSS assembler, and [the VG program] is able to output it.

The lack of a widely recognized and supported standard for representing variation inherent in a genome sequence is, in my opinion, a major barrier to moving forward. Almost all bioinformatics software that works with genome sequence data expects a single sequence with no variation. It will require a whole new generation of tools to work with a variant-based format, but tool developers will be reluctant to write new code if there is no clear agreement on what is the new de facto file format.

How to cite bioinformatics resources

I saw a post on Biostars today that asked how specific versions of genome assemblies should be cited. This question also applies to the more general issue of citing any bioinformatics resource which may have multiple releases/versions which are not all formally published in papers. Here is how I replied:

Citing versions of any particular bioinformatics/genomics resources can get tricky because there is often no formal publication for every release of a given dataset. Further complicating the situation is the fact that you will often come across different dates (and even names) for the same resource. E.g. the latest cow genome assembly generated by the University of Maryland is known as 'UMD 3.1.1'. However, the UCSC genome browser uses their own internal IDs for all cow genome assemblies and refers to this as 'bosTau8'. Someone new to the field might see the UCSC version and not know about the original UMD name.

Sometimes you can use dates of files on FTP sites to approximately date sequence files, but these can sometimes change (sometimes files accidentally get removed and replaced from backups, which can change their date).

The key thing to aim for is to provide suitable information so that someone can reproduce your work. In my mind, this requires 2–3 pieces of information:

  1. The name or release number of the dataset you are downloading (provide alternate names when known)
  2. The specific URL for the website or FTP site that you used to download the data
  3. The date on which you downloaded the data

E.g. The UMD 3.1.1 version of the cow genome assembly (also known as bosTau8) was downloaded from the UCSC Genome FTP site (ftp://hgdownload.cse.ucsc.edu/bosTau8/bigZips/bosTau8.fa.gz).

When no version number is available — it is very unhelpful not to provide version numbers of sequence resources: they can, and will change — I always refer to the date that I downloaded it instead.

Is Amazon's new 'unlimited' cloud drive suitable for bioinformatics?

Amazon have revealed new plans for their cloud drive service. Impressively, their 'Unlimited Everything' plan offers the chance to store an unlimited number of files and documents for just $59.99 per year (after a 3-month free trial no less).

News of this new unlimited storage service caught the attention of more than one bioinformatician:

If you didn't know, bioinformatics research can generate a lot of data. It is not uncommon to see individual files of DNA sequences stored in the FASTQ format reach 15–20 GB in size (and this is just plain text). Such files are nearly always processed to remove errors and contamination resulting in slightly smaller versions of each file. These processed files are often mapped to a genome or transcriptome which generates even more output files. The new output files in turn may be processed with other software tools leading to yet more output files. The raw input data should always be kept in case experiments need to be re-run with different settings so a typical bioinformatics pipeline may end up generating terabytes of data. Compression can help, but the typical research group will always be generating more and more data, which usually means a constant struggle to store (and backup) everything.

So could Amazon's new unlimited storage offer a way of dealing with the common file-management headache which plagues bioinformaticians (and their sys admins)? Well probably not. Their Terms of Use contain an important section (emphasis mine):

3.2 Usage Restrictions and Limits. The Service is offered in the United States. We may restrict access from other locations. There may be limits on the types of content you can store and share using the Service, such as file types we don't support, and on the number or type of devices you can use to access the Service. We may impose other restrictions on use of the Service.

You may be able to get away with using this service to store large amounts of bioinformatics data, but I don't think Amazon are intending for it to be used by anyone in this manner. So it wouldn't surprise me if Amazon quietly started imposing restrictions on certain file types or slowing bandwidth for heavy users such that it would make it impractical to rely on for day-to-day usage.

Regarding the current state of bioinformatics training

Todd Harris (@tharris) is on a bit of a roll at the moment. Last month I linked to his excellent blog post regarding community annotation, and today I find myself linking to his latest blog post:

Todd makes a convincing argument that bioinformatics education has largely failed, and he lists three reasons for this, the last of which is as follows:

Finally, the nature of much bioinformatics training is too rarefied. It doesn’t spend enough time on core skills like basic scripting and data processing. For example, algorithm development has no place in a bioinformatics overview course, more so if that is the only exposure to the field the student will have.

I particularly empathize with this point. There should be a much greater emphasis on core data processing skills in bioinformatics training, but ideally students should be getting access to some of these skills at an even earlier age. Efforts such as the Hour of Code initiative are helping raise awareness regarding the need to teach coding skills — and it's good to see the President join in with this — but it would be so much better if coding was part of the curriculum everywhere. As Steve Jobs once said:

"I think everybody in this country should learn … a computer language because it teaches you how to think … I view computer science as a liberal art. It should be something that everybody learns, takes a year in their life, one of the courses they take is learn how to program" — Steve Jobs, 1995.

Taken from 'Steve Jobs: The Lost Interview

Maybe this is still a pipe dream, but if we can't teach useful coding skills for everyone, we should at least be doing this for everyone who is considering any sort of career in the biological sciences. During my time at UC Davis, I've helped teach some basic Unix and Perl skills to many graduate students, but frustratingly this teaching has often come at the end of their first year in Grad School. By this point in their graduate training, they have often already encountered many data management problems and have not been equipped with the necessary skills to help them deal with those problems.

I think that part of the problem is that we still use the label 'bioinformatics training' and this reinforces the distinction from a more generic 'biological training'. It may once have been the case that bioinformatics was its own specialized field, but today I find that bioinformatics mostly just describes a useful set of data processing skills…skills which will be needed by anybody working in the life sciences.

Maybe we need to rebrand 'bioinformatics training', and use a name which better describes the general importance of these skills ('Essential data training for biologists?'). Whatever we decide to call it, it is clear that we need it more than ever. Todd ends his post with a great piece of advice for any current graduate students in the biosciences:

You should be receiving bioinformatics training as part of your core curriculum. If you aren’t, your program is failing you and you should seek out this training independently. You should also ask your program leaders and department chairs why training in this field isn’t being made available to you.


More madness with MAPQ scores (a.k.a. why bioinformaticians hate poor and incomplete software documentation)

I have previously written about the range of mapping quality scores (MAPQ) that you might see in BAM/SAM files, as produced by popular read mapping programs. A very quick recap:

  1. Bowtie 2 generates MAPQ scores between 0–42
  2. BWA generates MAPQ scores between 0–37
  3. Neither piece of software describes the range of possible scores in their documentation
  4. The SAM specification defines the possible ranges of the MAPQ score as 0–255 (though 255 should indicate that mapping quality was not available)
  5. I advocated that you should always take a look at your mapped sequence data to see what ranges of scores are present before doing anything else with your BAM/SAM files

So what is my latest gripe? Well, I've recently been running TopHat (version 2.0.13) to map some RNA-Seq reads to a genome sequence. TopHat uses Bowtie (or Bowtie 2) as the tool to do the intial mapping of reads to the genome, so you might expect it to generate the same range of MAPQ scores as the standalone version of Bowtie.

But it doesn't.

From my initial testing, it seems that the BAM/SAM output file from TopHat only contains MAPQ scores of 0, 1, 3, or 50. I find this puzzling and incongruous. Why produce only four MAPQ scores (compared to >30 different values that Bowtie 2 can produce), and why change the maximum possible value to 50? I turned to the TopHat manual, but found no explanation regarding MAPQ scores.

Turning to Google, I found this useful Biostars post which suggests that five MAPQ values are possible with TopHat (you can also have a value of 2 which I didn't see in my data), and that these values correspond to the following:

  • 0 = maps to 10 or more locations
  • 1 = maps to 4-9 locations
  • 2 = maps to 3 locations
  • 3 = maps to 2 locations
  • 50 = unique mapping

The post also reveals that, confusingly, TopHat previously used a value of 255 to indicate uniquely mapped reads. However, I then found another Biostars post which says that a MAPQ score of 2 isn't possible with TopHat, and that the meaning of the scores are as follows:

  • 0 = maps to 5 or more locations
  • 1 = maps to 3-4 locations
  • 3 = maps to 2 locations
  • 255 = unique mapping

This post was in reference to an older version of TopHat (1.4.1) which probably explains the use of the 255 score rather than 50. The comments on this post reflect some of the confusion over this topic. Going back to the original Biostars post, I then noticed a recent comment suggesting that MAPQ scores of 24, 28, 41, 42, and 44 are also possible with TopHat (version 2.0.13).

As this situation shows, when there is no official explanation that fully describes how a piece of software should work, it can lead to mass speculation by others. Such speculation can sometimes be inconsistant which can end up making things even more confusing. This is what drives bioinformaticians crazy.

I find it deeply frustrating when so much of this confusion could be removed with better documentation by the people that developed the original software. In this case the documentation needs just one paragraph added; something along the lines of…

Mapping Quality scores (MAPQ)
TopHat outputs MAPQ scores in the BAM/SAM files with possible values 0, 1, 2, or 50. The first three values indicate mappings to 5, 3–4, or 2 locations, whereas a value of 50 represents a unique match. Please note that older versions of TopHat used a value of 255 for unique matches. Further note that standalone versions of Bowtie and Bowie 2 (used by TopHat) produce a different range of MAPQ scores (0–42).

Would that be so hard?