The FASTA file format: a showcase for the best and worst of bioinformatics coding practices

June 25, 2013 by Keith Bradnam

The FASTA file format is one of the oldest recognized formats in bioinformatics and has become the lingua franca when trying to store sequence information in a plain text format. It is probably true to say that many people are much more likely to know of FASTA (the file format) than FASTA (the program). The FASTA program, a tool for comparing DNA or protein sequences, was first described by Bill Pearson and David Lipman back in 1988. However, FASTA (the program) was itself a successor to the original FASTP program that was described a few years earlier by the same authors in 1985.

The FASTA tool is still a going concern and the latest version (v36.3.5e) was released in May 2011. Thinking of FASTA as a single tool is a little misleading; it's really a package with over a dozen different programs. Part of the success of the FASTA software is perhaps due to the simplicity of the file format (which merits its own Wikipedia page). Lets look to see what the FASTA manual says about the format:

FASTA format files consist of a description line, beginning with a ’>’ character, followed by the sequence itself
...
All of the characters of the description line are read, and special characters can be used to indicate additional information about the sequence. In general, non-amino-acid/non-nucleotide sequences in the sequence lines are ignored.

You will note the absence of the word 'unique' in this description and this brings us to the issue that I really want to talk about. Today, Vince Buffalo sparked an interesting conversation thread on twitter:

It's not controversial if I say every FASTA ID should always be treated as a primary, unique foreign key, ya?— Vince Buffalo (@vsbuffalo) June 24, 2013

The FASTA format does not specify whether the description line (sometimes known as the 'header line' or 'definition line') should be unique or not. In some cases, it may not matter. In other cases, it clearly does. I find this issue — which probably doesn't sound very complex — gets to the heart of what is good and bad about bioinformatics. This is because it reveals a lot about the programmers who write code to work with FASTA-formatted files, and about how they deal with error handling.

Let's consider the following hypothetical FASTA file (which contains some overly simplistic DNA sequences):

>AT1G01140.1 (version 1)
AACCGGTT
>AT1G01140.1 (version 2)
TTGGCCAA
>AT1G01140.2 (version 1)
CCGGAATT
>AT1G01140.2 (version 1)
CCGGAATT

The 3rd and 4th entries are clearly duplicated (both FASTA header and sequence), and the other entries have similar. but unique, headers. Of course, the person who is using this sequence file may not be aware of this (a FASTA file could contain millions of sequences). So question number 1 is:

Who's job is it to check for unique FASTA headers? The person who generated the data file, or the person who wrote the software tool that has to process it?

We frequently generate FASTA tools directly from websites and databases so I think that we — bioinformatics users — have increasingly put too much faith in assuming that someone else (the person who wrote the FASTA-exporting code) has already tackled the problem for us.

This problem starts to get more complex when I tell you that I have used various bioinformatics tools in the past that have done one of two things with FASTA headers:

Ignore any characters beyond the nth character
Ignore any characters after whitepsace

In both of these situations, a program might consider the above file to contain only two unique entries (ignores after whitespace), or even one unique entry (ignores after 8 characters). This is clearly dangerous, and if the program in question needs to use a unique FASTA header as a key in hash, then duplicate headers may cause problems. Hopefully, a good programmer won't impose arbitrary limits like this. However, this brings me to question number 2:

How should a program deal with duplicate FASTA headers?

There are many options here:

Refuse to run without telling you what the problem is (I've seen this before)
Refuse to run and tell you that there are duplicate headers
Refuse to run and tell you which headers are duplicated
Run and simply ignore any entries with duplicate headers
Run and simply ignore any entries with duplicate headers, and report the duplicates
Check whether entries with duplicate headers also have duplicate sequences. This could potentially reveal whether it is a genuine duplication of the entire entry, or whether different genes/sequences just happen to share the same name/identifier (this is possible, if not unlikely, with short gene names from different species).
Check for duplicate headers, report on duplicates, but don't stop the program running and instead add an extra character to duplicate headers to make them unique (e.g. append .1, .2, .3 etc)

The last option is the most work for the programmer but will catch most errors and problems. But how many of us would ever take that much time to catch what might be user error, but which might be a glitch in the FASTA export routine of a different program?

This whole scenario gets messier still when you consider that some institutions (I'm looking at you NCBI) have imposed their own guidelines on FASTA description lines. In the original twitter thread, Matt MacManes suggested that FASTA headers be no longer than 100 characters. Sounds good in practice, but then you run into entries from databases like FlyBase such as this one:

>FBgn0046814 type=gene; loc=2L:complement(9950437..9950455); ID=FBgn0046814; name=mir-87; dbxref=FlyBase:FBgn0046814,FlyBase:FBan0032981,FlyBase_Annotation_IDs:CR32981,GB:AE003626,MIR:MI0000382,flight:FBgn0046814,flymine:FBgn0046814,hdri:FBgn0046814,ihop:1478786; derived_computed_cyto=30F1-30F1%3B Limits computationally determined from genome sequence between @P{EP}CG5899<up>EP701</up>@ and @P{EP}CG4747<up>EP594</up>@%26@P{lacW}l(2)k13305<up>k13305</up>@; gbunit=AE014134; MD5=10a65ff8961e4823bad6c34e37813302; length=19; release=r5.21; species=Dmel;
TGAGCAAAATTTCAGGTGT

That's 555 characters of FASTA description for 20 characters of sequence. You would hope that this FASTA header line is unique!

Introducing JABBA: Just Another Bogus Bioinformatics Acronym

June 24, 2013 by Keith Bradnam

For too long I have stood on the sidelines and remained silent. For too long I have witnessed atrocities and said nothing. For too long I have watched while people have maimed, mangled, and monkeyed around with the names of bioinformatics databases and tools.

Things have gone too far and I can remain silent no longer. As of today, I vow to bring these abominable appellations to justice. There are many, many bioinformatics tools and databases out there and while I accept that they all need a name, I don't accept that every name has to be an acronym (or initialism). This is especially so when the acronym appears to be formed from the awkward assembly of letters from the full name of the software/database.

Just as Jonathan Eisen has given the world the Badomics awards, I think it is time for me to introduce the Just Another Bogus Bioinformatics Acronym award, or JABBA for short. The inaugural winner of the JABBA award goes to a bioinformatics tool that's just been described in a new publication in Nucleic Acids Research:

BeAtMuSiC: prediction of changes in protein–protein binding affinity on mutations

Just take a moment to let that name sink in. If you are like me, you might be wondering how 'prediction of changes in protein-protein binding affinity on mutations' could give rise to 'BeAtMuSiC'. But before we get to that, lets consider the three principle ways in which you can form a bad name for a bioinformatics tool, and lets see how 'BeAtMuSiC' achieves the triple-whammy:

Choose a shortened name for your tool that is cute, funny, or unusual but which bears no relationship at all to bioinformatics. This gives you the added bonus of making it that much harder to find with a Google Search
Introduce some capitalization into your shortened name to make it so much less pleasing on the eye
Make no effort to ensure that your shortened name is a proper acronym and instead, just pick letters seemingly at random from the full name of your bioinformatics tool or database.

The latter point is worth dwelling on. The BeAtMuSiC name suggests that the full name of this tool will include the letters: B, A, M, S, and C. You might also assume that these letters in the shortened name would occur in the same order in the full name! But you would be wrong. A quick trip to the BeAtMuSiC website reveals that a) they really like the music theme and b) there is no logic to how this tool has been named.

The full name of this tool — as described on the website — is 'prediction of Binding Affinity Changes upon MutationS'. This is slightly different to the the subtitle of the paper described above, but lets assume the website version is the definitive arrangement of the name. The website shows how the 'C' in BeAtMuSiC can come before the 'M' and 'S' in the full name, because they put 'upon MutationS' on a second line of text in such a way that reveals that they are only considering the horizontal spacing of characters. Genius!

Congratulations, BeAtMuSiC...you are the inaugural winner of the JABBA award! I am saddened by the knowledge that there will be many more that will follow.

Update June 24th: Mick Watson pointed out to me on twitter that my idea of only considering the horizontal arrangement of letters still doesn't work. He's right. The 'C' ends up in the right place but you also have the 'M' before the 'A'. So in summary, nothing about this name makes sense.

Toilet graffiti at UC Davis

May 29, 2013 by Keith Bradnam

I came across the following piece of toilet graffiti on campus today. It says a lot about UC Davis, that even conversations penned on to the wall of a toilet are of such an academic nature.

For clarity, I have transcribed the exchange between two individuals, and added that below the photo.

Click to enlarge photo.

Gender differences in UC Davis VetMed students (1952–2011)

May 15, 2013 by Keith Bradnam

Sometimes I find myself walking down corridors on campus where you can see the graduation photos for past years of the UC Davis School of Veterinary Medicine. I'm struck by how much the gender balance has changed. In 1952, the oldest year for which a photo is shown, there are no women at all. I'm sure this is probably true of many other academic departments from this era, particularly in STEM subjects.

Fast forward 50 years or so, and the situation is very different. The latest graduation photo, from 2011, shows that 102 of the 125 students are women. Quite a reversal. I wonder how many other departments or schools have seen such a dramatic switch.

Here is how the gender balance has changed over time.

Don't make me angry, you wouldn't like me when I'm angry

March 30, 2013 by Keith Bradnam

Ever wonder what happens if you make a Caenorhabditis elegans nematode angry? Well you shouldn’t make a nematode angry, you wouldn’t like them if they are angry. Especially if they have harnessed the power of gamma radiation:

Before & after comparison:

First nematode image taken from uu.nl. Green bean taken from my kitchen.

Some brief thoughts on developing and supporting command-line bioinformatics tools

January 25, 2013 by Keith Bradnam

A student from a class I teach emailed me recently to ask for help with a bioinformatics problem. Our email exchange prompted him to ask:

…given that so many of the command line tools are in such a difficult state to use, why would biologists start using the command line? The bioinformatics community ought to be not only supply more free easy-to-use tools, but more such tools that work via the command line. Command line tools should be installable without having to compile the program, for example

I sympathize with the frustration of everyone who’s tried, and subsequently failed, to install and/or correctly use a piece of command-line driven bioinformatics software. I especially think that lack of documentation — not to mention good documentation — is a huge issue. However, I also sympathize with the people who have to write, release, and support bioinformatics code. It can be a thankless task. This was my reply to the student:

I agree with just about everything you have said, but bear in mind:

Writing code is fun; writing documentation (let alone ‘good’ documentation) is a pain and often receives no reward or thanks (not that this is an excuse to not write documentation).
So much software becomes obsolete so quickly, it can be hard to be motivated to spend a lot of time making something easy (or easier) to use when it you know that it will be supplanted by someone else’s tool in 12 months time
Providing ‘free’ tools is always good, but sometime has to pay for it somewhere (at least in time spent working on code, writing documentation, fixing bugs etc).
Providing tools that don’t require compilation means someone would have to make software available for every different flavor of Unix and Linux (not to mention Windows). It can be hard enough to make pre-compiled code work on one distribution of Linux. Software might never be released if someone had to first compile it for multiple systems. In this sense, it is a good thing that source code can be provided so at least you can try to compile yourself.
A lot of software development doesn’t result in much of the way of publication success. People sometimes try publishing papers on major updates to their software and are rejected by journals (for lack of novelty). Without a good reward in the form of the currency that scientists need (e.g. publications) it is sometimes hard to encourage someone to spend any more time than is necessary to work on code.
People that use freely developed software are often bad at not citing it properly and/or not including full command-line examples of how they used it (both of which can end up hurting the original developer through lack of acknowledgement). It is great when people decide to blog about how something worked, or share a post on forums like SeqAnswers.
Many users never bother to contact the developer to see if things could be changed/improved. Some developers are working in the dark as to what people actually need from software. In this sense, improving code and documentation is a two-way street.
To a degree, the nature of the command-line means that command-line tools will always be more user-unfriendly than a nice, shiny GUI. But at the same time, the nature of command-line tools means that they will always be more powerful than a GUI.

My memories of Simon Chan

August 27, 2012 by Keith Bradnam

I still feel in a complete state of shock that Simon Chan, my friend and colleague, is no longer with us. For most of the last four years I’ve been collaborating with Simon and his lab on a number of research projects and I came to know him well. What started as a professional relationship, developed into a friendship, and I would always look forward to our regular weekly meeting with Simon. It seems so hard to believe that these meetings will never happen again.

Over the last few days I have been humbled and amazed by the realization of just how many other people have been touched by Simon’s kindness, warmth, and by his inspirational personality. At the time of writing, there have now been 146 comments left behind on this blog post, which first announced the tragic news of Simon’s passing. Without exception, every single comment reveals a story of someone who’s life has been made better from having met Simon. These are not just comments from people who have known Simon for a long time. In many instances, people have shared their experience of meeting Simon just once, but revealing how strong an impact he still managed to have on them. A second memorial page adds another fifty or so comments that shed more light on what an amazing person Simon was.

Much has already been said about Simon’s scientific achievements, about his sweet and kind personality, and about his infectious positive energy which always left you coming out of meetings with him in a ‘we can do this!’ frame of mind. In this blog post, I wanted to briefly touch on two of his other qualities: his love of music, and his love of food.

Music

From our last practice session together, Dec 16 2010

I didn’t know Simon very well until he was invited to join the band that I was in (a UC Davis-based band that would play infrequent gigs at work-related functions). Simon could play saxophone, ukulele, and bass guitar and was very talented at all three instruments. It was the last of these instruments, the bass, that he played in our band. I’ve played in a few other bands and you sometimes play with musicians who, when learning songs for the first time, need to see the printed music in order to read every bar of every chord. More commonly, you’ll play with people who at least need to glance at the chord changes from time to time, or at the very least have to ask what key a song is in. Simon was not one of these people. He had a fantastic ear for music and could easily pick up any song and jam along effortlessly as soon as he heard a few notes. Whenever we’d have to stop playing mid-practice because one of us had screwed things up, you can bet that it was never Simon.

This was not the only reason why it was great to play alongside Simon in the band. He really loved to play music and was comfortable playing just about any genre of music. The majority of our band’s output mostly consisted of rock-based cover songs from the 50’s through to the 80’s. But Simon was just as happy when playing on some of our more jazz-inspired numbers as he was when playing on some of the grungier rock songs in our repertoire.

Within the band, we were all huge aficionados of the ‘mockumentary’ film This is Spinal Tap and given any chance at all, we would happily spend a lot of our practice time quoting from the film (what we would call STRs: Spinal Tap References). When Simon first joined, I knew he was a huge fan of jazz and wasn’t sure whether he would be irritated by our devotion to Spinal Tap and to the many, many STRs. I was overjoyed to discover that he loved the film too and was just as happy making people laugh with a well-timed STR.

From our last gig together, Dec 17 2010

Food

To say that Simon loved food, and loved trying new food, is something of a huge understatement. Simon lived for new culinary experiences. After being able to explore the gastronomic diversity of a city like Los Angeles (where he did his post-doc) he must have found Davis somewhat limiting. But that did not stop himself from trying just about everything that Davis, and the surrounding region, had to offer.

There were many conversations between us that would begin with me asking him ‘Have you tried new restaurant ‘X’ in Davis?’ and the answer was invariably ‘yes’. Occasionally I would try a new place in Davis in the week that it had opened, and this would give me the false confidence that I could approach Simon to tell him of a place that he had not yet tried. Invariably, however, he would have already been there and would be able to offer thoughtful commentary on their menu. If you ever ate with Simon at a place that he liked (e.g. Hometown Chinese in Davis), then it was common to realize that he was on first name terms with the owners. I can only imagine that Simon has made firm friendships with restauranteurs around the world.

For my 40th birthday I organized a quiz which Simon attended. One of the questions was based on guessing how many different eateries in Davis I had frequented (researched using the Davis Wiki restaurants page). The day after my birthday, Simon sent me an email to reveal that he had eaten at 97 of the 126 Davis restaurants on the list, with the exceptions — mostly chain fast-food restaurants — being through choice. The same email went on to reveal the total number of restaurants that Simon had eaten at since he first started logging such activities (I think this may have begun while he was in Los Angeles). Simon’s list featured a jaw-dropping 1,338 different restaurants! This was a man that loved his food.

As soon as Simon heard that I had to go to Vancouver to renew a visa, his first piece of advice was to check out a particular restaurant in China Town for their braised pork in soy sauce. When Simon recommends a place, you can’t really ignore that advice. I found the restaurant, ordered the pork, and it was indeed excellent.

Simon didn’t just love trying out new restaurants, he loved trying out different and unusual ingredients. He once shared details with me of The Omnivores Hundred: a list of one hundred different items that “every good omnivore should have tried at least once in their life”. This list includes relatively benign items such as ‘Eggs benedict’ and ‘Polenta’, but also includes such…shall we say interesting, delicacies as Brawn, Sweetbreads, and Roadkill. As of April 2009, Simon had tried 81 things off of this list, and I imagine he ticked off a few more in the years since then.

I once remarked to Simon that my wife and I often tried to host a traditional Burns Night Supper in Davis, but that it was really hard to find Haggis here. The next time I saw Simon, he produced a tin of haggis for me which he had found for us somewhere in the Bay Area. That was exactly the sort of person Simon was: always generous, always thinking of others…and probably always thinking of food!

Goodbye

Simon has been taken from us all far too soon and it still doesn’t seem real that he won’t be around to tell us of amazing new eateries he’s discovered, or regale us with tales of strange foods from far away lands. I’d like to extend my sincerest condolences to Simon’s family. At the same time, I’d like to thank them all for helping make Simon the wonderful person that he was. Like so many others, my life has been enriched for knowing Simon and I will treasure the memories that I have of him.

Farewell Simon. In music, in food, and in life, you always went up to 11 (STR).

Simon Chan 1974-2012

Happy to announce that our Unix & Perl book is finally on sale!

July 13, 2012 by Keith Bradnam

Unix and Perl to the RESCUE! A field-guide for life scientists (and other data-rich pursuits)

At least in the UK. It will take a few weeks for US warehouses to receive stock, and we have still not received our own copies, but at least the book is out there (somewhere). Amazon is discounting the book which puts it into a more affordable price range.

It will be very strange the first time I spot it 'in the wild' by chance. If I spot someone reading it on a bus/train/plane, I wonder whether I will be tempted to say anything.

Genome sequencing projects…it’s all about the numbers

July 12, 2012 by Keith Bradnam

So there is this project called Genome 10K. They aim to sequence the genomes of — wait for it — 10,000 vertebrate species. Impressive projects like this need impressive names, which increasingly means inserting [big number of your choice] into the project name. Don’t believe me? Well let's see what other big 'omics' sequencing projects are out there?

959 nematode genomes — because Caenorhabditis elegans has 959 cells
1000 Genomes — a deep catalog of human genetic variation
1KP — 1,000 plant (transcriptome) project
1001 Genomes — a catalog of Arabidopsis thaliana genetic variation
3K RGP — The 3,000 Rice Genomes Project
i5k — insect and other arthropod genome sequencing initiative
Genome 10K — as mentioned above, these will be vertebrate genomes
100,000 Genomes Project — human genomes from British patients
100K Pathogen Genome Project — sequencing foodborne pathogens

I wonder if there has been any confusion between the '1000 Genomes' and 1001 Genomes' projects, or the '100,000 Genomes Project' and '100K Pathogen Genome Project' ("these human genomes seem awfully small").

So where do we go after 100,000? A million of course. Although there isn’t a dedicated collaborative project for this, there is already an aim by the company Complete Genomics to sequence a million human genomes by (the end of?) 2014.

So if you want to make a big splash in genomics, then you ideally need to be thinking of at least a '10M' project to begin with. Otherwise, I guess you need to look for some other 'novelty' numbers like the '959 nematode genomes' project. How about the '42 Genomes Project — dedicated to finding the ultimate answer to life, the universe, and everything.

Updates:

2014-10-27: now includes mention of 1KP project
2014-10-28: now includes mention of 100,000 genomes project (h/t @NazeemaFatima), which also gave me a reason to reorganize a lot of the information in this post.
2014-10-30: added 3K RGP project

The slow death of bioinformatics and the eternal popularity of shoes

July 11, 2012 by Keith Bradnam

Some time ago I was playing around with Google Trends (formerly Google Insights for Search) and I randomly decided to see how the search term bioinformatics has fared since 2004 (this is as far back as you can search for a trend). This is what I found:

Initially I was quite surprised by this and so I then performed a search for genomics, only to see the same sort of trend:

According to Google the Y-axis of these graphs reflect "how many searches have been done for a particular term, relative to the total number of searches done on Google over time" (emphasis on the word 'relative' is mine). This could just mean that the absolute number of search terms for 'bioinformatics' and 'genomics' is the same, or has even grown, but has been swamped by an increase in the frequency of all other search terms. To a lesser degree, there seems to be fewer searches occurring for many different biologically-related terms, e.g. here is the graph for the word biology.

On top of the overall declining trend, I noticed that you can clearly see a dip in the middle of each year. Possibly, this corresponds to when millions of high-school kids take their long summer vacation and are therefore not searching about anything to do with school work. You can see similar annual 'wobbles' if you also search for chemistry or physics. So does this mean that all science-related searches are declining? Well, you might expect there to be growing interest in the newer fields of biology (and bioinformatics in particular) and related technologies. This does seem to be the case. Here is the graph for the search term next generation sequencing (note, I do not advocate using this phrase):

Clearly this term has exploded in popularity as everybody moved to using many of the newer sequencing technologies as opposed to the traditional Sanger method.

So clearly, some topics are becoming more popular, and more searched for. However I still feel that the decline for the term bioinformatics might indeed represent a real decline in the whole field of bioinformatics. That is not to say that I think less bioinformatics is being done these days, or that it is less 'worthy' as a field. Rather, I think bioinformatics has moved from being a specialist field that was carried out somewhat separately from 'traditional' wet-lab research, to something which is much more integrated with many other fields of research. There are still many dedicated bioinformatics group (the lab where I work is one such group), but I think that it is increasingly common that biologists need to — and want to — undertake some bioinformatics as part of their wider research. To me, bioinformatics has become part of mainstream biological research and that means that it no longer makes sense to think of it as a separate field as such.

Anyway, regardless of whether any particular biological term is rising or falling in popularity, I think it is more interesting to see what search terms remain eternally popular. Despite changing governments, economic turmoil, and global uncertainty what is it that we search for with any degree of constancy? My first guess seemed to be a good one. So let me end by presenting the Google Insights graph for the search term shoes: