Is Amazon's new 'unlimited' cloud drive suitable for bioinformatics?

Amazon have revealed new plans for their cloud drive service. Impressively, their 'Unlimited Everything' plan offers the chance to store an unlimited number of files and documents for just $59.99 per year (after a 3-month free trial no less).

News of this new unlimited storage service caught the attention of more than one bioinformatician:

If you didn't know, bioinformatics research can generate a lot of data. It is not uncommon to see individual files of DNA sequences stored in the FASTQ format reach 15–20 GB in size (and this is just plain text). Such files are nearly always processed to remove errors and contamination resulting in slightly smaller versions of each file. These processed files are often mapped to a genome or transcriptome which generates even more output files. The new output files in turn may be processed with other software tools leading to yet more output files. The raw input data should always be kept in case experiments need to be re-run with different settings so a typical bioinformatics pipeline may end up generating terabytes of data. Compression can help, but the typical research group will always be generating more and more data, which usually means a constant struggle to store (and backup) everything.

So could Amazon's new unlimited storage offer a way of dealing with the common file-management headache which plagues bioinformaticians (and their sys admins)? Well probably not. Their Terms of Use contain an important section (emphasis mine):

3.2 Usage Restrictions and Limits. The Service is offered in the United States. We may restrict access from other locations. There may be limits on the types of content you can store and share using the Service, such as file types we don't support, and on the number or type of devices you can use to access the Service. We may impose other restrictions on use of the Service.

You may be able to get away with using this service to store large amounts of bioinformatics data, but I don't think Amazon are intending for it to be used by anyone in this manner. So it wouldn't surprise me if Amazon quietly started imposing restrictions on certain file types or slowing bandwidth for heavy users such that it would make it impractical to rely on for day-to-day usage.

Google and WormBase: these are not the search results you're looking for

Today I wanted to look up a particular gene in the WormBase database. Rather than go to the WormBase website, I thought I would just search Google for the word 'wormbase' followed by the gene name (rpl-22). Surely this would be enough to put the Gene Summary page for rpl-22 at the top of the results?

Sadly no. Here are the results that I was presented with:

All ten of these results include information from the WormBase database regarding the rpl-22 gene and/or link to the WormBase page for the gene. But there are no search results for wormbase.org at all.

Very odd. Is WormBase not allowing themselves to be indexed by search engines? I see a similar lack of wormbase.org results when using bing, Ask, or DuckDuckGo. However, if I search Google for flybase rpl-22 or pombase rpl-22 I find the desired fly/yeast orthologs of the worm rpl-22 gene as the top Google result.

Regarding the current state of bioinformatics training

Todd Harris (@tharris) is on a bit of a roll at the moment. Last month I linked to his excellent blog post regarding community annotation, and today I find myself linking to his latest blog post:

Todd makes a convincing argument that bioinformatics education has largely failed, and he lists three reasons for this, the last of which is as follows:

Finally, the nature of much bioinformatics training is too rarefied. It doesn’t spend enough time on core skills like basic scripting and data processing. For example, algorithm development has no place in a bioinformatics overview course, more so if that is the only exposure to the field the student will have.

I particularly empathize with this point. There should be a much greater emphasis on core data processing skills in bioinformatics training, but ideally students should be getting access to some of these skills at an even earlier age. Efforts such as the Hour of Code initiative are helping raise awareness regarding the need to teach coding skills — and it's good to see the President join in with this — but it would be so much better if coding was part of the curriculum everywhere. As Steve Jobs once said:

"I think everybody in this country should learn … a computer language because it teaches you how to think … I view computer science as a liberal art. It should be something that everybody learns, takes a year in their life, one of the courses they take is learn how to program" — Steve Jobs, 1995.

Taken from 'Steve Jobs: The Lost Interview

Maybe this is still a pipe dream, but if we can't teach useful coding skills for everyone, we should at least be doing this for everyone who is considering any sort of career in the biological sciences. During my time at UC Davis, I've helped teach some basic Unix and Perl skills to many graduate students, but frustratingly this teaching has often come at the end of their first year in Grad School. By this point in their graduate training, they have often already encountered many data management problems and have not been equipped with the necessary skills to help them deal with those problems.

I think that part of the problem is that we still use the label 'bioinformatics training' and this reinforces the distinction from a more generic 'biological training'. It may once have been the case that bioinformatics was its own specialized field, but today I find that bioinformatics mostly just describes a useful set of data processing skills…skills which will be needed by anybody working in the life sciences.

Maybe we need to rebrand 'bioinformatics training', and use a name which better describes the general importance of these skills ('Essential data training for biologists?'). Whatever we decide to call it, it is clear that we need it more than ever. Todd ends his post with a great piece of advice for any current graduate students in the biosciences:

You should be receiving bioinformatics training as part of your core curriculum. If you aren’t, your program is failing you and you should seek out this training independently. You should also ask your program leaders and department chairs why training in this field isn’t being made available to you.


Thinking of naming your bioinformatics software? Be afraid (of me), be very afraid (of me)

Saw this tweet today by Jessica Chong (@jxchong):

I like the idea that people might be fearful of choosing a name that could provoke me into writing a JABBA post. If my never-ending tirade about bogus bioinformatics acronyms causes some people to at least think twice about their intended name, then I will take that as a minor victory for this website!

More madness with MAPQ scores (a.k.a. why bioinformaticians hate poor and incomplete software documentation)

I have previously written about the range of mapping quality scores (MAPQ) that you might see in BAM/SAM files, as produced by popular read mapping programs. A very quick recap:

  1. Bowtie 2 generates MAPQ scores between 0–42
  2. BWA generates MAPQ scores between 0–37
  3. Neither piece of software describes the range of possible scores in their documentation
  4. The SAM specification defines the possible ranges of the MAPQ score as 0–255 (though 255 should indicate that mapping quality was not available)
  5. I advocated that you should always take a look at your mapped sequence data to see what ranges of scores are present before doing anything else with your BAM/SAM files

So what is my latest gripe? Well, I've recently been running TopHat (version 2.0.13) to map some RNA-Seq reads to a genome sequence. TopHat uses Bowtie (or Bowtie 2) as the tool to do the intial mapping of reads to the genome, so you might expect it to generate the same range of MAPQ scores as the standalone version of Bowtie.

But it doesn't.

From my initial testing, it seems that the BAM/SAM output file from TopHat only contains MAPQ scores of 0, 1, 3, or 50. I find this puzzling and incongruous. Why produce only four MAPQ scores (compared to >30 different values that Bowtie 2 can produce), and why change the maximum possible value to 50? I turned to the TopHat manual, but found no explanation regarding MAPQ scores.

Turning to Google, I found this useful Biostars post which suggests that five MAPQ values are possible with TopHat (you can also have a value of 2 which I didn't see in my data), and that these values correspond to the following:

  • 0 = maps to 10 or more locations
  • 1 = maps to 4-9 locations
  • 2 = maps to 3 locations
  • 3 = maps to 2 locations
  • 50 = unique mapping

The post also reveals that, confusingly, TopHat previously used a value of 255 to indicate uniquely mapped reads. However, I then found another Biostars post which says that a MAPQ score of 2 isn't possible with TopHat, and that the meaning of the scores are as follows:

  • 0 = maps to 5 or more locations
  • 1 = maps to 3-4 locations
  • 3 = maps to 2 locations
  • 255 = unique mapping

This post was in reference to an older version of TopHat (1.4.1) which probably explains the use of the 255 score rather than 50. The comments on this post reflect some of the confusion over this topic. Going back to the original Biostars post, I then noticed a recent comment suggesting that MAPQ scores of 24, 28, 41, 42, and 44 are also possible with TopHat (version 2.0.13).

As this situation shows, when there is no official explanation that fully describes how a piece of software should work, it can lead to mass speculation by others. Such speculation can sometimes be inconsistant which can end up making things even more confusing. This is what drives bioinformaticians crazy.

I find it deeply frustrating when so much of this confusion could be removed with better documentation by the people that developed the original software. In this case the documentation needs just one paragraph added; something along the lines of…

Mapping Quality scores (MAPQ)
TopHat outputs MAPQ scores in the BAM/SAM files with possible values 0, 1, 2, or 50. The first three values indicate mappings to 5, 3–4, or 2 locations, whereas a value of 50 represents a unique match. Please note that older versions of TopHat used a value of 255 for unique matches. Further note that standalone versions of Bowtie and Bowie 2 (used by TopHat) produce a different range of MAPQ scores (0–42).

Would that be so hard?

New paper provides a great overview of the current state of genome assembly

The following paper by Stephen Richards and Shwetha Murali has just appeared in the journal Current Opinion in Insect Science:

Best practices in insect genome sequencing: what works and what doesn’t

In some ways I wish they had chosen a different title as the focus of this paper is much more about genome assembly than genome sequencing. Furthermore, it provides a great overview of all of the current strategies in genome assembly. This should be of interest to any non-insect researchers interested in the best way of putting a genome together. Here is part of the legend from a very informative table in the paper:

Table 1 — De novo genome assembly strategies:
Assembly software is designed for a specific sequencing and assembly strategy. Thus sequence must be generated with the assembly software and algorithm in mind, choosing a sequence strategy designed for a different assembly algorithm, or sequencing without thinking about assembly is usually a recipe for poor un-publishable assemblies. Here we survey different assembly strategies, with different sequence and library construction requirements.

Bioinformatics software names: the good, the bad, and the ugly

The Good

Given that I spend so much time criticising bad bioinformatics names, I probably should make more of an effort to those occasional flag names that I actually like! Here are a few:

RNAcentral: an international database of ncRNA sequences

A good reminder that a bioinformatics tool doesn't have to use acronyms or intialisms! The name is easy to remember and makes it fairly obvious as to what you might expect to find in this database.


KnotProt: a database of proteins with knots and slipknots

A simple, clever, and memorable, name. And once again, no acronym!


WormBase and FlyBase

Some personal bias here — I spent four years working at WormBase — but you have to admire the simplicity and elegance of the names. 'WormBase' sort of replaced it's predecessor ACeDB (A Caenorhabidtis elegans DataBase). I say 'sort of' because ACeDB was the name for both the underlying software (which continued to be used by WormBase) and the specific instance of the database that contained C. elegans data. This led to the somewhat confusing situation (circa 2000) of there being many public ACeDB databases for many different species, only one of which was the actual ACeDB resource with worm data.


The Bad

These are all worthy of a JABBA award:

The human DEPhOsphorylation database DEPOD: a 2015 update

I find it amusing that they couldn't even get the acronym correctly captitalized in the title of the paper. As the abstract confirms, the second 'D' in 'DEPOD' comes from the word 'database' which should be capitalized. So it is another tenuous selection of letters to form the name of the database, but I guess at least the name is unique and Google searches for depod database don't have any trouble finding this resource.


IMGT®, the international ImMunoGeneTics information system® 25 years on

It's a registered trademark and that little R appears at every mention of the name in the paper. This initialism is the first I've seen where all letters of the short name come from one word in the full name.


DoGSD: the dog and wolf genome SNP database

I have several issues with this:

  1. It's a poor acronym (not explicitly stated in the paper): Dog and wolf Genome Snp Database
  2. The word 'dog' contributes a 'D' to the name, but then you end up with 'DoG' in the final name. It looks odd.
  3. What did the poor wolf do to not get featured in the database name?
  4. The lower-case 'O' means that you potentially can read this as dog-ess-dee or do-gee-ess-dee.
  5. Why focus the name on just two types of canine species? What if they wanted to add SNPs from Jackals or Coyotes, are they going to change the name of the database? They could have just called this something like 'The Canine SNP Database' and avoided all of these problems.

The Ugly

Maybe not JABBA-award winners, but they come with their own problems:

MulRF: a software package for phylogenetic analysis using multi-copy gene trees

Sometimes using the lower-case letter 'L' in any name is just asking for trouble. Depending on the font, it can look like the number 1 or even a pipe character '|'. The second issue is concerns the pronouncability of this name. Mull-urf? Mull-ar-eff? It doesn't trip off the tongue.


DupliPHY-Web: a web server for DupliPHY and DupliPHY-ML

This tool is all about looking for gene duplications from a phylogenetic perspective, hence 'Dupli' + 'PHY'. I actually think this is quite a good choice of name, except for the inconsistent, and visually jarring, use of mixed case. Why not just 'Dupliphy'?


ChiTaRS 2.1—an improved database of the chimeric transcripts and RNA-seq data with novel sense–antisense chimeric RNA transcripts

It's not spelt out in detail, but one can assume that 'ChiTaRS' derives from the following letters: CHImeric Transcripts And Rna-Seq data. So it is not being a bogus bioinformatics acronym in that respect. But I find it visually unappealing. Mixed capitalization like this never scans well.


DoRiNA 2.0—upgrading the doRiNA database of RNA interactions in post-transcriptional regulation

The paper doesn't explicitly state how the word 'DoRiNA' is formed other than saying:

we have built the database of RNA interactions (doRiNA)

So one can assume that those letters are derived from 'Database Of Rna INterActions'. On the plus side, it is a unique name easily searchable with Google. On the negative side, it seems strange to have 'RNA' as part of your database name, only with an additional letter inserted inbetween.

Metassembler: Merging and optimizing de novo genome assemblies

There's a great new paper in bioRxiv by Alejandro Hernandez Wences and Michael Schatz. They directly address something I wondered about as we were running the Assemblathon contests. Namely, can you combine some of the submitted assemblies to make an even better assembly? Well the answer seems to be a resounding 'yes'.

For each of three species in the Assemblathon 2 project we applied our algorithm to the top 6 assemblies as ranked by the cumulative Z-score reported in the paper…

We evaluated the correctness and contiguity of the metassembly at each merging step using the metrics used by the Assemblathon 2 evaluation…

In all three species, the contiguity statistics are significantly improved by our metassembly algorithm

Hopefully their Metassembler tool will be useful in improving many other poor quality assemblies that are out there!

My favorite bioinformatics blogs of 2014

Yeah, so I’m a little late in getting around to writing this! The following are not presented in any particular order…


Blog: In between lines of code
Creator: Lex Nederbragt (@lexnederbragt)
Frequency of updates: maybe 1 post a month on average (but it can vary)
Recommended blog post: Developments in next generation sequencing – June 2014 edition

This blog primarily focuses on developments in modern sequencing technology and genome assembly. Required reading if you have an interest in the state of current sequencing technologies, and more importantly, where they are heading.


Blog: Bioinformatician at large
Creator: Ewan Birney (@ewanbirney)
Frequency of updates: less than 1 post a month on average
Recommended blog post: A cheat’s guide to histone modifications

Ewan doesn’t update the blog very often, but when he does post, he usually takes the time to provide us with a very detailed look at some aspect of genomics. Many of his posts explore the underlying science that people are addressing through various genomics/bioinformatics approaches.


Blog: Loman Labs Blog
Creator: Nick Loman (@pathogenomenick)
Frequency of updates: 1–2 posts a month on average
Recommended blog post: The infinite lie of easy bioinformatics

Nick covers lots of material relating to metagenomics and sequencing in general. As a self-confessed Oxford Nanopore fan boy he has some interesting thoughts and observations to share about this nascent sequencing technology, but he writes about most modern sequencing technologies. He also likes the occasional rant from time to time (don’t we all?).


Blog: Living in an Ivory Basement Stochastic thoughts on science, testing, and programming
Creator: C. Titus Brown (@ctitusbrown), a.k.a. Chuck Norris
Frequency of updates: several posts a week
Recommended blog post: Some myths of reproducible computational research

Titus covers a lot of different material on his blog. Many posts see him ‘thinking out loud’ on an issue, keeping people updated with developments with his training courses, or frequently asking people for their thoughts or suggestions on a topic. There are also detailed scientific posts relating to his interests in kmer-based approaches relating to genome assembly. Being a keen advocate (and practitioner) of open and reproducible science, Titus also uses his blog to write on these topics.


Blog: Omics! Omics! A computational biologist’s personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery
Creator: Keith Robison (@omicsomicsblog)
Frequency of updates: 1–2 posts per month (but many more during AGBT!)
Recommended blog post: A Sunset for Draft Genomes?

This blog is predominantly focused on the latest developments in sequencing technology. Keith provides great insight into future developments in the world of sequencing, and also tries to make sense of the claims and marketing hype that sometimes surrounds the announcements of new technologies. During the annual Advancements in Genome Biology and Technology (AGBT) meeting, you can rely on Keith to provide great commentary on what is happening (and not happening) at the meeting.


Blog: Opiniomics: bioinformatics, genomes, biology etc. “I don’t mean to sound angry and cynical, but I am, so that’s how it comes across”
Creator: Mick Watson (@biomickwatson)
Frequency of updates: 1–2 posts per week
Recommended blog post: Why anonymous peer review is bad for science

Of all the blogs that I’m including here, this is probably my favorite. I greatly enjoy Mick’s writing; not so much for the detailed technical posts about sequencing technology — good though these are — but for the fantastic pieces he writes about the wider field of bioinformatics. Mick has insightful views on such topics as peer review, training of bioinformaticians, and reproducible science. I particularly like Mick’s frequently humorous — and sometimes slightly ranty — style of writing. Oh, I should also point out that Mick’s site is best viewed on an iPad.