Thinking of writing a FASTA parser? 12 scenarios that might break your code

In today's Bits and Bites Coding class, we talked about the basics of FASTA and GFF file formats. Specifically, we were discussing the problems that can arise when you write scripts to parse these files, and the types of problems that might be present in the file which may break (or confuse) your script.

Even though you can find code to parse FASTA files from projects such as BioPerl, it can be instructive to try to do this yourself when you are learning a language. Many of the problems that occur when trying to write code to parse FASTA files will fall into the 'I wasn't expecting the file to look like that' category. I recently wrote about how the simplicity of the FASTA format is a double-edged sword. Because almost anything is allowed, it means that someone will — accidentally or otherwise — produce a FASTA file at some point that contains one of the following 12 scenarios. These are all things that a good FASTA parser should be able to deal with and, if necessary, warn the user:

> space_at_start_of_header_line
ACGTACGTACGTACGT

>Extra_>_in_FASTA_header
ACGTACGTACGTACGT

>Spaces_in_sequence
ACGTACGT ACGTACGT

>Spaces_in_sequence_and_between_lines
A C 

G T 

A C

A G 

A T

>Redundant_sequence_in_header_ACGTACGTACGT
ACGTACGTACGTACGT

><- missing definition line
ACGTACGTACGTACGT

>mixed_case
ACGTACGTACGTACGTgtagaggacgcaccagACGTACGTACGTACGT

>missing_sequence

>rare, but valid, IUPAC characters 
ACGTRYSWKMBDHVN

>errant sequence
Maybe I accidentally copied and pasted something
Maybe I used Microsoft Word to edit my FASTA sequence

>duplicate_FASTA_header
ACGTACGTACGTACGT
>duplicate_FASTA_header
ACGTACGTACGTACGT

>line_ending_problem^MACGTACGTACGTACGT^MACGTACGTACGTACGT^M>another_sequence_goes_here^MACGTACGTACGTACGT^MACGTACGTACGTACGT

Why is N50 used as an assembly metric (and what's the deal with NG50)?

This post started out as a lengthy reply on a thread at SEQanswers.com. I thought that it was worth reproducing it here (though I've also added some more information)...

Why N50?

Maybe it is worth remembering why we even have N50 as a statistic. The final assembly for any genome project, i.e. the one that is described in a paper or uploaded to a database, is not necessarily the biggest assembly that was generated.

This is because there will always be contigs that are too short to be of any use to anyone. In the pre-NGS era of assembly, these contigs could potentially be represented by a single Sanger read that didn't align to anything else.

However, one might consider that there is still some utility in including a long Sanger read in an assembly, even if it doesn't overlap with any other reads. After quality trimming, there are many Sanger reads that are over 1,000 bp in length and this is long enough to contain an exon or two, or even an entire gene (depending on the species). But what if a contig was 500 bp? Or 250 bp? Or 100 bp? Where do you draw the line?

Clearly it is not desirable to fill up an assembly with an abundance of overly short sequences, even if they represent accurate, and unique, biological sequences from the genome of interest. So it has always been common to to remove very short contigs from an assembly. The problem is that different groups might use very different criteria for what to keep and what to exclude. Imagine a simple assembly consisting of the following contigs:

  • 5,000 x 100 bp
  • 100 x 10 Kbp
  • 10 x 1 Mbp
  • 1 x 10 Mbp

Now let's say that Distinguished Bioinformatics Center #1 decides to produce an assembly (DBC1) that includes all of these contigs. However, another DBC decides to make another assembly from the same data, but they remove the 100 bp contigs. A third DBC decides to also remove the 10 Kbp contigs. What does this do to the mean contig length of each assembly?

Mean contig lengths

  • DBC1 = 4,207 bp
  • DBC2 = 189,189 bp
  • DBC3 = 1,818,182 bp

Hopefully it becomes obvious that if you only considered mean contig length, it would be extremely easy to 'game' the system by deliberately excluding shorter contigs (no matter their utility) just to increase the average contig length. But what do we get if we instead rely on N50?

N50 contig lengths

  • DBC1 = 1 Mbp
  • DBC2 = 1 Mbp
  • DBC3 = 10 Mbp

This now reduces the overall discrepancies, and puts DBC1 and DBC2 on an equal footing. But you might still, naively, conclude that DBC3 is the better assembly, and if you were extremely wide-eyed and innocent, then maybe you would conclude that DBC3 was ten times better than the other two assemblies.

So N50 does a better, though still imperfect, job of avoiding the dangers inherent in relying on the mean length. In some ways, the actual method you use to calculate N50 does not matter too much, as long as you use the same method when comparing all assemblies. Back in the day of Sanger-sequence derived assemblies, it was fairly common to see assembly statistics report not just N50, but everything from N10 through to N90. This gives you a much better insight into the overall variation in contig (or scaffold lengths).

In the Assemblathon and Assemblathon 2 contests, we actually plotted all N(x) values to see the full distribution of contig lengths. Except, we didn't use the N50 metric, we used something called NG50. This normalizes for differences in overall assembly sizes by asking 'at what contig/scaffold length — from a set of sorted scaffold lengths — do we see a sum length that is greater than 50% of the estimated or known genome size?'

Returning to my earlier fictional example, lets assume that the estimated genome size that was being assembled was 25 Mbp. This means we want to see what contig length takes us past 12.5 Mbp (when summing all contig lengths from longest to shortest):

NG50 contig lengths

  • DBC1 = 1 Mbp
  • DBC2 = 1 Mbp
  • DBC3 = 1 Mbp

We now see parity between all assemblies, at least when assessing their contig lengths. The actual differences in the assemblies mostly reflect variations in which short sequences are included which may, or may not, be of any utility to the end user of the assembly. In my mind, this gives us a much fairer way of comparing the assemblies, at least in terms of their average contig lengths.

NG(X) graphs

Here is an example of an NG(X) chart, that borrows some results from the Assemblathon 2 contest:

ngx.png

Some things to note:

  • Y-axis (scaffold length) is plotted on a log scale
  • Each line represents scaffold lengths from a genome assembly
  • The NG50 value is shown as a dotted line. If you only looked at this you might consider that the red assembly is the best and purple the worst.
  • The first data point in each series, reveals the longest scaffold length. This is why you can see horizontal lines for some entries. E.g. the longest scaffold for the red assembly is about 70 Mbp. This exceeds the value of NG1, NG2, NG3 etc.
  • From NG1 to NG50, the relative differences in assemblies is fairly constant. Except that the longest scaffold in the purple assembly is longer than that in the orange assembly. But by NG5, the orange assembly has longer scaffolds.
  • Around NG75, we start to see changes. The lengths of scaffolds in the blue assembly start to drop off.
  • At NG90, the lengths of scaffolds in the green assembly plummet. The fact that this series touches the x-axis (at about NG91) tells us that the sum size of this assembly is less than the estimated genome size of the species in question (i.e. you have to have a value at NG100 to have 100% of the estimated genome size present in your assembly).
  • By NG95, the orange assembly has moved from 5th to 2nd. This just means that the assembly contains a higher proportion of longer scaffolds. However...
  • Three of the assemblies reach NG100 with positive values on the y-axis. This reveals that these assemblies are all bigger than the estimated genome size. Ideally we could continue this plot to see where they would eventually touch the x-axis. If they reached NG200, then we would know that the genome assembly is twice the estimated genome size.
  • Hopefully, this shows that you can get a lot of information out of a plot like this.

In summary

N50 is just a better, though still flawed, way of calculating an average sequence length from a set of sequences. It can still be biased, and you really should consider what the estimated/known genome size is (when comparing two or more assemblies of the same species), and/or look to see the full distribution of sequence lengths (e.g. with an NG(x) graph).

Slides from a talk on 'the art of good science writing'

I presented in our lab meeting today and used the occasion to talk about writing for science (as well as writing in general). I get to proof-read a lot of texts in the lab (scholarship applications, resumes, grad school applications, manuscript drafts etc.) and I have noticed some common issues that crop up a lot. This talk goes over some of these problem areas and offers some suggestions.

Thanks to the students for letting me share some of their writing.

Installing Circos on CentOS

Circos is a software tool that lets you make beautiful circular figures, which are increasingly a common feature of genome-related papers (here are some examples). Circos, which is free to use, is built on Perl and requires a lot of Perl modules to be installed in order to work.

This is not always as straightforward as one might imagine, though Circos helps you by providing a tool to quickly see which modules you already have and which still need to be installed. I've previously set up Circos on Mac OS X, but today I had to install it on our Linux box (running CentOS 6). These are the steps I went through in order to get Circos up and running. Hopefully, this might be useful to others in the same situation.

### Installing Circos on CentOS ###

# These instructions assume that you are installing Circos in 
# a '/Bioinformatics/Packages' directory and also that you have root access

# Preparatory steps
su
mkdir /Bioinformatics/Packages/Circos
cd /Bioinformatics/Packages/Circos


# Download the latest *.tgz release and copy to Circos directory
wget http://circos.ca/distribution/circos-0.64.tgz
tar -xvzf circos-0.64.tgz
cd circos-0.64


# To test what modules we will need to install, use Circos' built-in tool
cd bin
./test.modules


# This leads to having to install:
perl -MCPAN -e 'install Clone'
perl -MCPAN -e 'install Config::General'
perl -MCPAN -e 'install Font::TTF::Font'
perl -MCPAN -e 'install GD'

# GD module installation failed and so I had to to also install gd library:
yum install gd gd-devel


# now back to the Perl modules
perl -MCPAN -e 'install GD'
perl -MCPAN -e 'install List::MoreUtils'
perl -MCPAN -e 'install Math::Round'
perl -MCPAN -e 'install Math::VecStat'
perl -MCPAN -e 'install Params::Validate'

# Params::Validate didn’t seem to install properly, so I tried a manual approach
cd
wget http://search.cpan.org/CPAN/authors/id/D/DR/DROLSKY/Params-Validate-1.07.tar.gz
tar -xvzf Params-Validate-1.07.tar.gz
cd Params-Validate-1.07
perl Build.PL

# This failed because I needed a newer version of some other Perl modules
# even though the ./test_modules script reported everything was okay
perl -MCPAN -e 'install Module::Build'
perl -MCPAN -e 'install Module::Implementation'


# Can now hand install Params-Validate
./Build
./Build test
./Build install

# And back to the Perl modules
perl -MCPAN -e 'install Readonly'
perl -MCPAN -e 'install Regexp::Common'
perl -MCPAN -e 'install Text::Format'


# Tidy up and set up symbolic links
rm -rf ~/Params-Validate-1.07
rm ~/Params-Validate-1.07.tar.gz
cd /Bioinformatics/Packages/Circos
ln -s circos-0.64 current
cd /Bioinformatics/bin
ln -s ../Packages/Circos/current/bin/circos

# Try running circos
./circos

# This still wasn’t enough!!!
# Running circos complained that I needed 2 more Perl Modules!
perl -MCPAN -e 'install Math::Bezier'
perl -MCPAN -e 'install Set::IntSpan'

# Try running circos again
./circos

# Success! And by 'success' I mean that you can now get to a point where you are
# seeing errors from the Circos program (i.e. you need to set up your Circos
# configuration file).

The FASTA file format: a showcase for the best and worst of bioinformatics coding practices

The FASTA file format is one of the oldest recognized formats in bioinformatics and has become the lingua franca when trying to store sequence information in a plain text format. It is probably true to say that many people are much more likely to know of FASTA (the file format) than FASTA (the program). The FASTA program, a tool for comparing DNA or protein sequences, was first described by Bill Pearson and David Lipman back in 1988. However, FASTA (the program) was itself a successor to the original FASTP program that was described a few years earlier by the same authors in 1985.

The FASTA tool is still a going concern and the latest version (v36.3.5e) was released in May 2011. Thinking of FASTA as a single tool is a little misleading; it's really a package with over a dozen different programs. Part of the success of the FASTA software is perhaps due to the simplicity of the file format (which merits its own Wikipedia page). Lets look to see what the FASTA manual says about the format:

FASTA format files consist of a description line, beginning with a ’>’ character, followed by the sequence itself
...
All of the characters of the description line are read, and special characters can be used to indicate additional information about the sequence. In general, non-amino-acid/non-nucleotide sequences in the sequence lines are ignored.

You will note the absence of the word 'unique' in this description and this brings us to the issue that I really want to talk about. Today, Vince Buffalo sparked an interesting conversation thread on twitter:

It's not controversial if I say every FASTA ID should always be treated as a primary, unique foreign key, ya?— Vince Buffalo (@vsbuffalo) June 24, 2013

The FASTA format does not specify whether the description line (sometimes known as the 'header line' or 'definition line') should be unique or not. In some cases, it may not matter. In other cases, it clearly does. I find this issue — which probably doesn't sound very complex — gets to the heart of what is good and bad about bioinformatics. This is because it reveals a lot about the programmers who write code to work with FASTA-formatted files, and about how they deal with error handling.

Let's consider the following hypothetical FASTA file (which contains some overly simplistic DNA sequences):

>AT1G01140.1 (version 1)
AACCGGTT
>AT1G01140.1 (version 2)
TTGGCCAA
>AT1G01140.2 (version 1)
CCGGAATT
>AT1G01140.2 (version 1)
CCGGAATT

The 3rd and 4th entries are clearly duplicated (both FASTA header and sequence), and the other entries have similar. but unique, headers. Of course, the person who is using this sequence file may not be aware of this (a FASTA file could contain millions of sequences). So question number 1 is:

Who's job is it to check for unique FASTA headers? The person who generated the data file, or the person who wrote the software tool that has to process it?

We frequently generate FASTA tools directly from websites and databases so I think that we — bioinformatics users — have increasingly put too much faith in assuming that someone else (the person who wrote the FASTA-exporting code) has already tackled the problem for us.

This problem starts to get more complex when I tell you that I have used various bioinformatics tools in the past that have done one of two things with FASTA headers:

  1. Ignore any characters beyond the nth character
  2. Ignore any characters after whitepsace

In both of these situations, a program might consider the above file to contain only two unique entries (ignores after whitespace), or even one unique entry (ignores after 8 characters). This is clearly dangerous, and if the program in question needs to use a unique FASTA header as a key in hash, then duplicate headers may cause problems. Hopefully, a good programmer won't impose arbitrary limits like this. However, this brings me to question number 2:

How should a program deal with duplicate FASTA headers?

There are many options here:

  1. Refuse to run without telling you what the problem is (I've seen this before)
  2. Refuse to run and tell you that there are duplicate headers
  3. Refuse to run and tell you which headers are duplicated
  4. Run and simply ignore any entries with duplicate headers
  5. Run and simply ignore any entries with duplicate headers, and report the duplicates
  6. Check whether entries with duplicate headers also have duplicate sequences. This could potentially reveal whether it is a genuine duplication of the entire entry, or whether different genes/sequences just happen to share the same name/identifier (this is possible, if not unlikely, with short gene names from different species).
  7. Check for duplicate headers, report on duplicates, but don't stop the program running and instead add an extra character to duplicate headers to make them unique (e.g. append .1, .2, .3 etc)

The last option is the most work for the programmer but will catch most errors and problems. But how many of us would ever take that much time to catch what might be user error, but which might be a glitch in the FASTA export routine of a different program?

This whole scenario gets messier still when you consider that some institutions (I'm looking at you NCBI) have imposed their own guidelines on FASTA description lines. In the original twitter thread, Matt MacManes suggested that FASTA headers be no longer than 100 characters. Sounds good in practice, but then you run into entries from databases like FlyBase such as this one:

>FBgn0046814 type=gene; loc=2L:complement(9950437..9950455); ID=FBgn0046814; name=mir-87; dbxref=FlyBase:FBgn0046814,FlyBase:FBan0032981,FlyBase_Annotation_IDs:CR32981,GB:AE003626,MIR:MI0000382,flight:FBgn0046814,flymine:FBgn0046814,hdri:FBgn0046814,ihop:1478786; derived_computed_cyto=30F1-30F1%3B Limits computationally determined from genome sequence between @P{EP}CG5899<up>EP701</up>@ and @P{EP}CG4747<up>EP594</up>@%26@P{lacW}l(2)k13305<up>k13305</up>@; gbunit=AE014134; MD5=10a65ff8961e4823bad6c34e37813302; length=19; release=r5.21; species=Dmel;
TGAGCAAAATTTCAGGTGT

That's 555 characters of FASTA description for 20 characters of sequence. You would hope that this FASTA header line is unique!

Introducing JABBA: Just Another Bogus Bioinformatics Acronym

For too long I have stood on the sidelines and remained silent. For too long I have witnessed atrocities and said nothing. For too long I have watched while people have maimed, mangled, and monkeyed around with the names of bioinformatics databases and tools.

Things have gone too far and I can remain silent no longer. As of today, I vow to bring these abominable appellations to justice. There are many, many bioinformatics tools and databases out there and while I accept that they all need a name, I don't accept that every name has to be an acronym (or initialism). This is especially so when the acronym appears to be formed from the awkward assembly of letters from the full name of the software/database.

Just as Jonathan Eisen has given the world the Badomics awards, I think it is time for me to introduce the Just Another Bogus Bioinformatics Acronym award, or JABBA for short. The inaugural winner of the JABBA award goes to a bioinformatics tool that's just been described in a new publication in Nucleic Acids Research:

BeAtMuSiC: prediction of changes in protein–protein binding affinity on mutations

Just take a moment to let that name sink in. If you are like me, you might be wondering how 'prediction of changes in protein-protein binding affinity on mutations' could give rise to 'BeAtMuSiC'. But before we get to that, lets consider the three principle ways in which you can form a bad name for a bioinformatics tool, and lets see how 'BeAtMuSiC' achieves the triple-whammy:

  1. Choose a shortened name for your tool that is cute, funny, or unusual but which bears no relationship at all to bioinformatics. This gives you the added bonus of making it that much harder to find with a Google Search
  2. Introduce some capitalization into your shortened name to make it so much less pleasing on the eye
  3. Make no effort to ensure that your shortened name is a proper acronym and instead, just pick letters seemingly at random from the full name of your bioinformatics tool or database.

The latter point is worth dwelling on. The BeAtMuSiC name suggests that the full name of this tool will include the letters: B, A, M, S, and C. You might also assume that these letters in the shortened name would occur in the same order in the full name! But you would be wrong. A quick trip to the BeAtMuSiC website reveals that a) they really like the music theme and b) there is no logic to how this tool has been named.

The full name of this tool — as described on the website — is 'prediction of Binding Affinity Changes upon MutationS'. This is slightly different to the the subtitle of the paper described above, but lets assume the website version is the definitive arrangement of the name. The website shows how the 'C' in BeAtMuSiC can come before the 'M' and 'S' in the full name, because they put 'upon MutationS' on a second line of text in such a way that reveals that they are only considering the horizontal spacing of characters. Genius!

2013-06-23 at 5.02.07 PM.png

Congratulations, BeAtMuSiC...you are the inaugural winner of the JABBA award! I am saddened by the knowledge that there will be many more that will follow.

Update June 24th: Mick Watson pointed out to me on twitter that my idea of only considering the horizontal arrangement of letters still doesn't work. He's right. The 'C' ends up in the right place but you also have the 'M' before the 'A'. So in summary, nothing about this name makes sense.

Gender differences in UC Davis VetMed students (1952–2011)

Sometimes I find myself walking down corridors on campus where you can see the graduation photos for past years of the UC Davis School of Veterinary Medicine. I'm struck by how much the gender balance has changed. In 1952, the oldest year for which a photo is shown, there are no women at all. I'm sure this is probably true of many other academic departments from this era, particularly in STEM subjects.

IMG_1327.jpg

Fast forward 50 years or so, and the situation is very different. The latest graduation photo, from 2011, shows that 102 of the 125 students are women. Quite a reversal. I wonder how many other departments or schools have seen such a dramatic switch.

IMG_1325.jpg

Here is how the gender balance has changed over time.

vetmed.png

Some brief thoughts on developing and supporting command-line bioinformatics tools

A student from a class I teach emailed me recently to ask for help with a bioinformatics problem. Our email exchange prompted him to ask:

…given that so many of the command line tools are in such a difficult state to use, why would biologists start using the command line? The bioinformatics community ought to be not only supply more free easy-to-use tools, but more such tools that work via the command line. Command line tools should be installable without having to compile the program, for example

I sympathize with the frustration of everyone who’s tried, and subsequently failed, to install and/or correctly use a piece of command-line driven bioinformatics software. I especially think that lack of documentation — not to mention good documentation — is a huge issue. However, I also sympathize with the people who have to write, release, and support bioinformatics code. It can be a thankless task. This was my reply to the student:

I agree with just about everything you have said, but bear in mind:

  1. Writing code is fun; writing documentation (let alone ‘good’ documentation) is a pain and often receives no reward or thanks (not that this is an excuse to not write documentation).
  2. So much software becomes obsolete so quickly, it can be hard to be motivated to spend a lot of time making something easy (or easier) to use when it you know that it will be supplanted by someone else’s tool in 12 months time
  3. Providing ‘free’ tools is always good, but sometime has to pay for it somewhere (at least in time spent working on code, writing documentation, fixing bugs etc).
  4. Providing tools that don’t require compilation means someone would have to make software available for every different flavor of Unix and Linux (not to mention Windows). It can be hard enough to make pre-compiled code work on one distribution of Linux. Software might never be released if someone had to first compile it for multiple systems. In this sense, it is a good thing that source code can be provided so at least you can try to compile yourself.
  5. A lot of software development doesn’t result in much of the way of publication success. People sometimes try publishing papers on major updates to their software and are rejected by journals (for lack of novelty). Without a good reward in the form of the currency that scientists need (e.g. publications) it is sometimes hard to encourage someone to spend any more time than is necessary to work on code.
  6. People that use freely developed software are often bad at not citing it properly and/or not including full command-line examples of how they used it (both of which can end up hurting the original developer through lack of acknowledgement). It is great when people decide to blog about how something worked, or share a post on forums like SeqAnswers.
  7. Many users never bother to contact the developer to see if things could be changed/improved. Some developers are working in the dark as to what people actually need from software. In this sense, improving code and documentation is a two-way street.
  8. To a degree, the nature of the command-line means that command-line tools will always be more user-unfriendly than a nice, shiny GUI. But at the same time, the nature of command-line tools means that they will always be more powerful than a GUI.