Mining UniProt: the Roy Chaudhuri quest to find DNA-like protein sequences

Image adapted from original image by flickr user lwr

So I recently came up with this idea for #tweetsascode: using twitter to write tweets which contain functional programs within their 140 characters. I posted one such example last Friday, a Perl script which checks a FASTA file (specified on the command-line), in order to determine whether it contains a protein or DNA sequence:

#!/usr/bin/perl
use strict; 
use warnings;

while(<>){
    next if m/(^>)|(^$)/;
    die "Protein" if (m/[EFILOPQ]/i);
    die "DNA";
}
#tweetsascode

This code skips FASTA definition lines (those starting with '>') and blank lines, and then asks: does the first line of sequence contain any of the seven amino-acid characters which are not IUPAC nucleotide characters? If so, then it must be protein; otherwise the sequence is probably DNA.

This led @RoyChaudhuri to comment:

Roy's point is that there are so many IUPAC nucleotide characters, that a protein sequence which only contained 13 out of the 20 canonical amino acids, would also pass the test as a valid nucleotide sequence. Is it possible to therefore determine how many 'DNA-like' proteins there are?

Experiment

With the help of a little Perl script, I did the following:

  1. I first downloaded the FASTA files for SWISS-PROT and TrEMBL, which collectively comprise the UniProt protein database. If you didn't know, SWISS-PROT contains manually annotated entries whereas the much larger TrEMBL database is automatically annotated.
  2. For every protein sequence in SWISS-PROT or TrEMBL, my script counts the use of various protein ambiguity characters (this was just out of curiosity). These are B (aspartic acid or asparagine), J (leucine or iosoleucine), Z (glutamate or glutamine), and X (unknown amino acid).
  3. The script also counts usage of the 21st and 22nd amino acids (selenocysteine and pyrrolysine, which have the valid IUPAC characters U and O respectively).
  4. The script counted any protein sequences which only contained amino acids that have equivalent IUPAC characters for the set of four canonical nucleotides (i.e. alanine, cysteine, glycine, and threonine).
  5. Finally, the script counted any protein sequences which only contained amino acids that have equivalents from any of the 16 IUPAC nucleotide characters.

Results (SWISS-PROT)

Dataset = 549,008 proteins

  • 546,360 only contained the 20 'classic' amino acid characters
  • 254 contained selenocysteine characters (U)
  • 29 contained pyrrolysine characters (O)
  • 138 contained alternative amino acid character B (representing D or N)
  • 0 contained alternative amino acid character J (representing L or I)
  • 114 contained alternative amino acid character Z (representing E or Q)
  • 2,222 contained unknown amino acid characters (X)
  • Only 1 protein was comprised entirely of A, C, G, and T
  • An additional 123 proteins were comprised entirely from valid IUPAC nucleotide characters ([ACGTNUWSKMRYBDHV])

The sequence that contained only 'classic' DNA characters was a 31 amino acid fragment, which turned out to contain only two different amino acids (alanine and threonine):

>sp|P02732|ANP3_PAGBO Ice-structuring glycoprotein 3 (Fragments) OS=Pagothenia borchgrevinki PE=1 SV=1 AATAATAATAATAATAATAATAATAATAATA

Of the 123 proteins that used various characters from the full set of IUPAC nucleotide characters, this 128 amino acid protein was the longest:

>sp|Q925H4|KR211_MOUSE Keratin-associated protein 21-1 OS=Mus musculus GN=Krtap21-1 PE=2 SV=2 MCCNYYGNSCGGCGYGSRYGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYG SGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSG YGCGYGSRYGCGYGSGCCSYRKCYSSCC

This SWISS-PROT entry (accession Q925H4) is a mouse protein which has experimental evidence, and which has an Annotation score of 5 out of 5.

Results (TrEMBL)

Dataset = 50,011,027 proteins

  • 49,373,499 only contained the 20 'classic' amino acid characters
  • 1697 contained selenocysteine characters (U)
  • 199 contained pyrrolysine characters (O)
  • 15,314 contained alternative amino acid character B (representing D or N)
  • 0 contained alternative amino acid character J (representing L or I)
  • 5,842 contained alternative amino acid character Z (representing E or Q)
  • 632,742 contained unknown amino acid characters (X)
  • Only 2 proteins were comprised entirely of A, C, G, and T
  • An additional 1,827 proteins were comprised entirely from valid IUPAC nucleotide characters ([ACGTNUWSKMRYBDHV])

In this much larger set of proteins we still don't see a sequence that resembles 'classic DNA' that is any longer than the 31 amino acid fragment found in SWISS-PROT. Instead, the longest sequence was a 24 amino acid fragment (which only contains alanine and glycine):

>tr|U6PUI2|U6PUI2HAECO ISE/inbred ISE genomic scaffold, scaffoldpathogensHcontortusscaffold6340 (Fragment) OS=Haemonchus contortus GN=HCOI01698300 PE=4 SV=1
GAAAAGGGGGGGGGGGGAAAAGGA

However, in TrEMBL there was a much longer sequence which contains various characters from the full set of IUPAC nucleotide characters. This uncharacterized protein is 495 amino acids long, and contains mostly serine, arginine, and cysteine:

>tr|A0A0E9H024|A0A0E9H024STREE Uncharacterised protein OS=Streptococcus pneumoniae GN=ERS23250802220 PE=4 SV=1 MRSRSYYTSVSRRKSSSSSSRSSSSSRSSSSCSSCRSSSSSRSSSSCRSS SSCSSCRSSRSSRSSSSCSSSRSCRSCSSSRSCSSCRSSSSCSSCRSSRS SRSSSSSRSSRSSSSCSSSRSSSSCRSSRSSRSSRSSRSCSSCRSSSSCS SCRSCRSSSSCSSCRSSRSSRSSSSCRSSSSCSSCSSCRSSRSSSSCSSS RSCRSCRSCRSSSSSSSCSSSRSSRSSSSSRSCSSCRSSSSCSSCRSSSS CSSCRSSRSSSSCSSCSSCRSSRSSSSCSSSRSSRSSRSCRSSSSSRSCR SCSSCRSSSSCSSCRSSRSSRSCSSCRSSRSSRSCRSSSSSRSSRSSSSS RSSRSSSSCRSSRSSSSSRSCSSCRSSRSSSSCSSSRSSSSSRSCSSSRS CSSCRSSSSCSSCRSSRSSRSSSSSRSSRSSSSCSSSRSSSSCRSSRSSR SSRSSRSSSSSRSSRSSSSSRSSRSSSSCRSCRSSSSCSSCSSSR

Honorable mention to XXX-rated protein

It is worth giving a shout out to UniProt accession W4XLU5 (from the TrEMBL database). This uncharacterized protein has a length of 21,842 amino acids…21,292 of which are represented by unknown amino acids!!! This is probably why the protein has an Annotation score of just 1 out of 5.

Conclusions

  1. To answer Roy's initial question, only 0.00004% of proteins in UniProt (1,953 / 50,560,035) fulfil the requirement of only containing amino acids that have equivalent IUPAC nucleotide characters.
  2. From a coding point of view, one should possibly account for the fact that you can see almost 500 DNA-like characters in a sequence, but you still could be looking at a protein sequence.
  3. A ~22,000 amino acid protein which contains 97% 'unknown' residues, should maybe take the award for 'least-useful protein annotation'