The FASTA file format: a showcase for the best and worst of bioinformatics coding practices

The FASTA file format is one of the oldest recognized formats in bioinformatics and has become the lingua franca when trying to store sequence information in a plain text format. It is probably true to say that many people are much more likely to know of FASTA (the file format) than FASTA (the program). The FASTA program, a tool for comparing DNA or protein sequences, was first described by Bill Pearson and David Lipman back in 1988. However, FASTA (the program) was itself a successor to the original FASTP program that was described a few years earlier by the same authors in 1985.

The FASTA tool is still a going concern and the latest version (v36.3.5e) was released in May 2011. Thinking of FASTA as a single tool is a little misleading; it's really a package with over a dozen different programs. Part of the success of the FASTA software is perhaps due to the simplicity of the file format (which merits its own Wikipedia page). Lets look to see what the FASTA manual says about the format:

FASTA format files consist of a description line, beginning with a ’>’ character, followed by the sequence itself
...
All of the characters of the description line are read, and special characters can be used to indicate additional information about the sequence. In general, non-amino-acid/non-nucleotide sequences in the sequence lines are ignored.

You will note the absence of the word 'unique' in this description and this brings us to the issue that I really want to talk about. Today, Vince Buffalo sparked an interesting conversation thread on twitter:

It's not controversial if I say every FASTA ID should always be treated as a primary, unique foreign key, ya?— Vince Buffalo (@vsbuffalo) June 24, 2013

The FASTA format does not specify whether the description line (sometimes known as the 'header line' or 'definition line') should be unique or not. In some cases, it may not matter. In other cases, it clearly does. I find this issue — which probably doesn't sound very complex — gets to the heart of what is good and bad about bioinformatics. This is because it reveals a lot about the programmers who write code to work with FASTA-formatted files, and about how they deal with error handling.

Let's consider the following hypothetical FASTA file (which contains some overly simplistic DNA sequences):

>AT1G01140.1 (version 1)
AACCGGTT
>AT1G01140.1 (version 2)
TTGGCCAA
>AT1G01140.2 (version 1)
CCGGAATT
>AT1G01140.2 (version 1)
CCGGAATT

The 3rd and 4th entries are clearly duplicated (both FASTA header and sequence), and the other entries have similar. but unique, headers. Of course, the person who is using this sequence file may not be aware of this (a FASTA file could contain millions of sequences). So question number 1 is:

Who's job is it to check for unique FASTA headers? The person who generated the data file, or the person who wrote the software tool that has to process it?

We frequently generate FASTA tools directly from websites and databases so I think that we — bioinformatics users — have increasingly put too much faith in assuming that someone else (the person who wrote the FASTA-exporting code) has already tackled the problem for us.

This problem starts to get more complex when I tell you that I have used various bioinformatics tools in the past that have done one of two things with FASTA headers:

  1. Ignore any characters beyond the nth character
  2. Ignore any characters after whitepsace

In both of these situations, a program might consider the above file to contain only two unique entries (ignores after whitespace), or even one unique entry (ignores after 8 characters). This is clearly dangerous, and if the program in question needs to use a unique FASTA header as a key in hash, then duplicate headers may cause problems. Hopefully, a good programmer won't impose arbitrary limits like this. However, this brings me to question number 2:

How should a program deal with duplicate FASTA headers?

There are many options here:

  1. Refuse to run without telling you what the problem is (I've seen this before)
  2. Refuse to run and tell you that there are duplicate headers
  3. Refuse to run and tell you which headers are duplicated
  4. Run and simply ignore any entries with duplicate headers
  5. Run and simply ignore any entries with duplicate headers, and report the duplicates
  6. Check whether entries with duplicate headers also have duplicate sequences. This could potentially reveal whether it is a genuine duplication of the entire entry, or whether different genes/sequences just happen to share the same name/identifier (this is possible, if not unlikely, with short gene names from different species).
  7. Check for duplicate headers, report on duplicates, but don't stop the program running and instead add an extra character to duplicate headers to make them unique (e.g. append .1, .2, .3 etc)

The last option is the most work for the programmer but will catch most errors and problems. But how many of us would ever take that much time to catch what might be user error, but which might be a glitch in the FASTA export routine of a different program?

This whole scenario gets messier still when you consider that some institutions (I'm looking at you NCBI) have imposed their own guidelines on FASTA description lines. In the original twitter thread, Matt MacManes suggested that FASTA headers be no longer than 100 characters. Sounds good in practice, but then you run into entries from databases like FlyBase such as this one:

>FBgn0046814 type=gene; loc=2L:complement(9950437..9950455); ID=FBgn0046814; name=mir-87; dbxref=FlyBase:FBgn0046814,FlyBase:FBan0032981,FlyBase_Annotation_IDs:CR32981,GB:AE003626,MIR:MI0000382,flight:FBgn0046814,flymine:FBgn0046814,hdri:FBgn0046814,ihop:1478786; derived_computed_cyto=30F1-30F1%3B Limits computationally determined from genome sequence between @P{EP}CG5899<up>EP701</up>@ and @P{EP}CG4747<up>EP594</up>@%26@P{lacW}l(2)k13305<up>k13305</up>@; gbunit=AE014134; MD5=10a65ff8961e4823bad6c34e37813302; length=19; release=r5.21; species=Dmel;
TGAGCAAAATTTCAGGTGT

That's 555 characters of FASTA description for 20 characters of sequence. You would hope that this FASTA header line is unique!

Introducing JABBA: Just Another Bogus Bioinformatics Acronym

For too long I have stood on the sidelines and remained silent. For too long I have witnessed atrocities and said nothing. For too long I have watched while people have maimed, mangled, and monkeyed around with the names of bioinformatics databases and tools.

Things have gone too far and I can remain silent no longer. As of today, I vow to bring these abominable appellations to justice. There are many, many bioinformatics tools and databases out there and while I accept that they all need a name, I don't accept that every name has to be an acronym (or initialism). This is especially so when the acronym appears to be formed from the awkward assembly of letters from the full name of the software/database.

Just as Jonathan Eisen has given the world the Badomics awards, I think it is time for me to introduce the Just Another Bogus Bioinformatics Acronym award, or JABBA for short. The inaugural winner of the JABBA award goes to a bioinformatics tool that's just been described in a new publication in Nucleic Acids Research:

BeAtMuSiC: prediction of changes in protein–protein binding affinity on mutations

Just take a moment to let that name sink in. If you are like me, you might be wondering how 'prediction of changes in protein-protein binding affinity on mutations' could give rise to 'BeAtMuSiC'. But before we get to that, lets consider the three principle ways in which you can form a bad name for a bioinformatics tool, and lets see how 'BeAtMuSiC' achieves the triple-whammy:

  1. Choose a shortened name for your tool that is cute, funny, or unusual but which bears no relationship at all to bioinformatics. This gives you the added bonus of making it that much harder to find with a Google Search
  2. Introduce some capitalization into your shortened name to make it so much less pleasing on the eye
  3. Make no effort to ensure that your shortened name is a proper acronym and instead, just pick letters seemingly at random from the full name of your bioinformatics tool or database.

The latter point is worth dwelling on. The BeAtMuSiC name suggests that the full name of this tool will include the letters: B, A, M, S, and C. You might also assume that these letters in the shortened name would occur in the same order in the full name! But you would be wrong. A quick trip to the BeAtMuSiC website reveals that a) they really like the music theme and b) there is no logic to how this tool has been named.

The full name of this tool — as described on the website — is 'prediction of Binding Affinity Changes upon MutationS'. This is slightly different to the the subtitle of the paper described above, but lets assume the website version is the definitive arrangement of the name. The website shows how the 'C' in BeAtMuSiC can come before the 'M' and 'S' in the full name, because they put 'upon MutationS' on a second line of text in such a way that reveals that they are only considering the horizontal spacing of characters. Genius!

2013-06-23 at 5.02.07 PM.png

Congratulations, BeAtMuSiC...you are the inaugural winner of the JABBA award! I am saddened by the knowledge that there will be many more that will follow.

Update June 24th: Mick Watson pointed out to me on twitter that my idea of only considering the horizontal arrangement of letters still doesn't work. He's right. The 'C' ends up in the right place but you also have the 'M' before the 'A'. So in summary, nothing about this name makes sense.

Gender differences in UC Davis VetMed students (1952–2011)

Sometimes I find myself walking down corridors on campus where you can see the graduation photos for past years of the UC Davis School of Veterinary Medicine. I'm struck by how much the gender balance has changed. In 1952, the oldest year for which a photo is shown, there are no women at all. I'm sure this is probably true of many other academic departments from this era, particularly in STEM subjects.

IMG_1327.jpg

Fast forward 50 years or so, and the situation is very different. The latest graduation photo, from 2011, shows that 102 of the 125 students are women. Quite a reversal. I wonder how many other departments or schools have seen such a dramatic switch.

IMG_1325.jpg

Here is how the gender balance has changed over time.

vetmed.png