Fun with an 'error message' from NCBI BLAST+

Consider this very simple DNA sequence in FASTA format:

>sequence1
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
ttttttagaaaaattatttttaagaatttttcattttaggaatattgtta
tttcagaaaatagctaaatgtgatttctgtaattttgcctgccaaattcg
tgaaatgcaataaaaatctaatatccctcatcagtgcgatttccgaatca
gtatatttttacgtaatagcttctttgacatcaataagtatttgcctata
tgactttagacttgaaattggctattaatgccaatttcatgatatctagc
cactttagtataattgtttttagtttttggcaaaactattgtctaaacag

If you try converting this to a BLAST database using the 'makeblastdb' command from the latest version of NCBI's BLAST+ suite, you will see the following line included in the output:

Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 100% ambiguous nucleotides (shouldn't be over 40%)

Now consider what happens if you run the same makeblastdb command on this sequence (which just switches the first two lines of sequence1):

>sequence2
ttttttagaaaaattatttttaagaatttttcattttaggaatattgtta
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
tttcagaaaatagctaaatgtgatttctgtaattttgcctgccaaattcg
tgaaatgcaataaaaatctaatatccctcatcagtgcgatttccgaatca
gtatatttttacgtaatagcttctttgacatcaataagtatttgcctata
tgactttagacttgaaattggctattaatgccaatttcatgatatctagc
cactttagtataattgtttttagtttttggcaaaactattgtctaaacag

Although this sequence has the exact same proportion of As, Cs, Gs, Ts, and Ns, it does not produce the error message. What about the following sequence?

>sequence3
nnnac
ttttttagaaaaattatttttaagaatttttcattttaggaatattgtta
tttcagaaaatagctaaatgtgatttctgtaattttgcctgccaaattcg
tgaaatgcaataaaaatctaatatccctcatcagtgcgatttccgaatca
gtatatttttacgtaatagcttctttgacatcaataagtatttgcctata
tgactttagacttgaaattggctattaatgccaatttcatgatatctagc
cactttagtataattgtttttagtttttggcaaaactattgtctaaacag

Well, surprise surprise, this sequence produces the error message again (though it now tells you that the first line 'is about 60% ambiguous nucleotides'). You will still see the same message even if you added 1 billion As, Cs, Gs, and Ts on to the end of sequence 3. This seems to be the code responsible for the error message (taken from this page):

In case it wasn't obvious, here is why this annoys me:

  1. The comment in the code indicates that this should be treated as a warning (less serious), but then the message starts with a prefix of 'Error' (more serious). So it's an warning and an error?
  2. It only considers the first line of sequence data. I appreciate that this is easiest thing to check, but it is not very useful if all of your ambiguous bases are at the end of the sequence (or anywhere past the first line).
  3. What is the rationale for choosing 40% as the threshold for warning? It seems a little too arbitrary.
  4. It produces this warning if the first line at least 40% ambiguous and if it also has a length greater than 3 bp! This means that it can be triggered with a line that starts 'NNNAC' as in my sequence3 example above.
  5. It considers all ambiguity codes as being equal. So if I switched my first line of sequence3 from NNNAC to RWBAC, it would still be rejected even though the sequence RWB contains much more information than NNN.
  6. The way the output text bluntly says 'shouldn't be over 40%' comes across as very matter-of-fact, as if you've transgressed some unknown law of bioinformatics.

So here are my suggestions for an alternative (which admittedly requires some more coding):

  1. If a sequence is less than 1,000 bp check all of the sequence, otherwise check the first 1,000 bp of sequence (if not more).
  2. Report the output as a warning and not an error.
  3. Change the warning message. E.g. 'The first 1,000 bp of your sequence contains a high proportion (X%) of ambiguous bases. Such sequences may not be very useful for any downstream analysis that you perform with BLAST+.'

Thinking of writing a FASTA parser? 12 scenarios that might break your code

In today's Bits and Bites Coding class, we talked about the basics of FASTA and GFF file formats. Specifically, we were discussing the problems that can arise when you write scripts to parse these files, and the types of problems that might be present in the file which may break (or confuse) your script.

Even though you can find code to parse FASTA files from projects such as BioPerl, it can be instructive to try to do this yourself when you are learning a language. Many of the problems that occur when trying to write code to parse FASTA files will fall into the 'I wasn't expecting the file to look like that' category. I recently wrote about how the simplicity of the FASTA format is a double-edged sword. Because almost anything is allowed, it means that someone will — accidentally or otherwise — produce a FASTA file at some point that contains one of the following 12 scenarios. These are all things that a good FASTA parser should be able to deal with and, if necessary, warn the user:

> space_at_start_of_header_line
ACGTACGTACGTACGT

>Extra_>_in_FASTA_header
ACGTACGTACGTACGT

>Spaces_in_sequence
ACGTACGT ACGTACGT

>Spaces_in_sequence_and_between_lines
A C 

G T 

A C

A G 

A T

>Redundant_sequence_in_header_ACGTACGTACGT
ACGTACGTACGTACGT

><- missing definition line
ACGTACGTACGTACGT

>mixed_case
ACGTACGTACGTACGTgtagaggacgcaccagACGTACGTACGTACGT

>missing_sequence

>rare, but valid, IUPAC characters 
ACGTRYSWKMBDHVN

>errant sequence
Maybe I accidentally copied and pasted something
Maybe I used Microsoft Word to edit my FASTA sequence

>duplicate_FASTA_header
ACGTACGTACGTACGT
>duplicate_FASTA_header
ACGTACGTACGTACGT

>line_ending_problem^MACGTACGTACGTACGT^MACGTACGTACGTACGT^M>another_sequence_goes_here^MACGTACGTACGTACGT^MACGTACGTACGTACGT

The FASTA file format: a showcase for the best and worst of bioinformatics coding practices

The FASTA file format is one of the oldest recognized formats in bioinformatics and has become the lingua franca when trying to store sequence information in a plain text format. It is probably true to say that many people are much more likely to know of FASTA (the file format) than FASTA (the program). The FASTA program, a tool for comparing DNA or protein sequences, was first described by Bill Pearson and David Lipman back in 1988. However, FASTA (the program) was itself a successor to the original FASTP program that was described a few years earlier by the same authors in 1985.

The FASTA tool is still a going concern and the latest version (v36.3.5e) was released in May 2011. Thinking of FASTA as a single tool is a little misleading; it's really a package with over a dozen different programs. Part of the success of the FASTA software is perhaps due to the simplicity of the file format (which merits its own Wikipedia page). Lets look to see what the FASTA manual says about the format:

FASTA format files consist of a description line, beginning with a ’>’ character, followed by the sequence itself
...
All of the characters of the description line are read, and special characters can be used to indicate additional information about the sequence. In general, non-amino-acid/non-nucleotide sequences in the sequence lines are ignored.

You will note the absence of the word 'unique' in this description and this brings us to the issue that I really want to talk about. Today, Vince Buffalo sparked an interesting conversation thread on twitter:

It's not controversial if I say every FASTA ID should always be treated as a primary, unique foreign key, ya?— Vince Buffalo (@vsbuffalo) June 24, 2013

The FASTA format does not specify whether the description line (sometimes known as the 'header line' or 'definition line') should be unique or not. In some cases, it may not matter. In other cases, it clearly does. I find this issue — which probably doesn't sound very complex — gets to the heart of what is good and bad about bioinformatics. This is because it reveals a lot about the programmers who write code to work with FASTA-formatted files, and about how they deal with error handling.

Let's consider the following hypothetical FASTA file (which contains some overly simplistic DNA sequences):

>AT1G01140.1 (version 1)
AACCGGTT
>AT1G01140.1 (version 2)
TTGGCCAA
>AT1G01140.2 (version 1)
CCGGAATT
>AT1G01140.2 (version 1)
CCGGAATT

The 3rd and 4th entries are clearly duplicated (both FASTA header and sequence), and the other entries have similar. but unique, headers. Of course, the person who is using this sequence file may not be aware of this (a FASTA file could contain millions of sequences). So question number 1 is:

Who's job is it to check for unique FASTA headers? The person who generated the data file, or the person who wrote the software tool that has to process it?

We frequently generate FASTA tools directly from websites and databases so I think that we — bioinformatics users — have increasingly put too much faith in assuming that someone else (the person who wrote the FASTA-exporting code) has already tackled the problem for us.

This problem starts to get more complex when I tell you that I have used various bioinformatics tools in the past that have done one of two things with FASTA headers:

  1. Ignore any characters beyond the nth character
  2. Ignore any characters after whitepsace

In both of these situations, a program might consider the above file to contain only two unique entries (ignores after whitespace), or even one unique entry (ignores after 8 characters). This is clearly dangerous, and if the program in question needs to use a unique FASTA header as a key in hash, then duplicate headers may cause problems. Hopefully, a good programmer won't impose arbitrary limits like this. However, this brings me to question number 2:

How should a program deal with duplicate FASTA headers?

There are many options here:

  1. Refuse to run without telling you what the problem is (I've seen this before)
  2. Refuse to run and tell you that there are duplicate headers
  3. Refuse to run and tell you which headers are duplicated
  4. Run and simply ignore any entries with duplicate headers
  5. Run and simply ignore any entries with duplicate headers, and report the duplicates
  6. Check whether entries with duplicate headers also have duplicate sequences. This could potentially reveal whether it is a genuine duplication of the entire entry, or whether different genes/sequences just happen to share the same name/identifier (this is possible, if not unlikely, with short gene names from different species).
  7. Check for duplicate headers, report on duplicates, but don't stop the program running and instead add an extra character to duplicate headers to make them unique (e.g. append .1, .2, .3 etc)

The last option is the most work for the programmer but will catch most errors and problems. But how many of us would ever take that much time to catch what might be user error, but which might be a glitch in the FASTA export routine of a different program?

This whole scenario gets messier still when you consider that some institutions (I'm looking at you NCBI) have imposed their own guidelines on FASTA description lines. In the original twitter thread, Matt MacManes suggested that FASTA headers be no longer than 100 characters. Sounds good in practice, but then you run into entries from databases like FlyBase such as this one:

>FBgn0046814 type=gene; loc=2L:complement(9950437..9950455); ID=FBgn0046814; name=mir-87; dbxref=FlyBase:FBgn0046814,FlyBase:FBan0032981,FlyBase_Annotation_IDs:CR32981,GB:AE003626,MIR:MI0000382,flight:FBgn0046814,flymine:FBgn0046814,hdri:FBgn0046814,ihop:1478786; derived_computed_cyto=30F1-30F1%3B Limits computationally determined from genome sequence between @P{EP}CG5899<up>EP701</up>@ and @P{EP}CG4747<up>EP594</up>@%26@P{lacW}l(2)k13305<up>k13305</up>@; gbunit=AE014134; MD5=10a65ff8961e4823bad6c34e37813302; length=19; release=r5.21; species=Dmel;
TGAGCAAAATTTCAGGTGT

That's 555 characters of FASTA description for 20 characters of sequence. You would hope that this FASTA header line is unique!