Fun with an 'error message' from NCBI BLAST+

Consider this very simple DNA sequence in FASTA format:

>sequence1
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
ttttttagaaaaattatttttaagaatttttcattttaggaatattgtta
tttcagaaaatagctaaatgtgatttctgtaattttgcctgccaaattcg
tgaaatgcaataaaaatctaatatccctcatcagtgcgatttccgaatca
gtatatttttacgtaatagcttctttgacatcaataagtatttgcctata
tgactttagacttgaaattggctattaatgccaatttcatgatatctagc
cactttagtataattgtttttagtttttggcaaaactattgtctaaacag

If you try converting this to a BLAST database using the 'makeblastdb' command from the latest version of NCBI's BLAST+ suite, you will see the following line included in the output:

Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 100% ambiguous nucleotides (shouldn't be over 40%)

Now consider what happens if you run the same makeblastdb command on this sequence (which just switches the first two lines of sequence1):

>sequence2
ttttttagaaaaattatttttaagaatttttcattttaggaatattgtta
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
tttcagaaaatagctaaatgtgatttctgtaattttgcctgccaaattcg
tgaaatgcaataaaaatctaatatccctcatcagtgcgatttccgaatca
gtatatttttacgtaatagcttctttgacatcaataagtatttgcctata
tgactttagacttgaaattggctattaatgccaatttcatgatatctagc
cactttagtataattgtttttagtttttggcaaaactattgtctaaacag

Although this sequence has the exact same proportion of As, Cs, Gs, Ts, and Ns, it does not produce the error message. What about the following sequence?

>sequence3
nnnac
ttttttagaaaaattatttttaagaatttttcattttaggaatattgtta
tttcagaaaatagctaaatgtgatttctgtaattttgcctgccaaattcg
tgaaatgcaataaaaatctaatatccctcatcagtgcgatttccgaatca
gtatatttttacgtaatagcttctttgacatcaataagtatttgcctata
tgactttagacttgaaattggctattaatgccaatttcatgatatctagc
cactttagtataattgtttttagtttttggcaaaactattgtctaaacag

Well, surprise surprise, this sequence produces the error message again (though it now tells you that the first line 'is about 60% ambiguous nucleotides'). You will still see the same message even if you added 1 billion As, Cs, Gs, and Ts on to the end of sequence 3. This seems to be the code responsible for the error message (taken from this page):

In case it wasn't obvious, here is why this annoys me:

  1. The comment in the code indicates that this should be treated as a warning (less serious), but then the message starts with a prefix of 'Error' (more serious). So it's an warning and an error?
  2. It only considers the first line of sequence data. I appreciate that this is easiest thing to check, but it is not very useful if all of your ambiguous bases are at the end of the sequence (or anywhere past the first line).
  3. What is the rationale for choosing 40% as the threshold for warning? It seems a little too arbitrary.
  4. It produces this warning if the first line at least 40% ambiguous and if it also has a length greater than 3 bp! This means that it can be triggered with a line that starts 'NNNAC' as in my sequence3 example above.
  5. It considers all ambiguity codes as being equal. So if I switched my first line of sequence3 from NNNAC to RWBAC, it would still be rejected even though the sequence RWB contains much more information than NNN.
  6. The way the output text bluntly says 'shouldn't be over 40%' comes across as very matter-of-fact, as if you've transgressed some unknown law of bioinformatics.

So here are my suggestions for an alternative (which admittedly requires some more coding):

  1. If a sequence is less than 1,000 bp check all of the sequence, otherwise check the first 1,000 bp of sequence (if not more).
  2. Report the output as a warning and not an error.
  3. Change the warning message. E.g. 'The first 1,000 bp of your sequence contains a high proportion (X%) of ambiguous bases. Such sequences may not be very useful for any downstream analysis that you perform with BLAST+.'