If you want your bioinformatics software to have a memorable name, it helps if the name is pronounceable

Image from she is geeky blog

There is a new paper in the journal Bioinformatics:

The paper describes a new method for implementing a Principle Components Analysis (PCA) of data. That new method has a name. That name has just seven characters. How hard can it be to pronounce?

  • S4VDPCA: ess-four-vee-dee-pee-cee-ay

It doesn't exactly trip off the tongue and having four 'ee-sounding' letters together (VDPC) doesn't make it easy to remember. When I first came across this paper, I skimmed the article, waited an hour, and then tried to remember the name. I could remember that it included '4', 'V', and 'D', but couldn't remember the order (or that it also included an 'S')

It is by no means essential that bioinformatics tools have easily pronounceable names, but this will help people remember the name of your software. In turn, this makes it easier for people to tell others about your software. I don't imagine that bioinformatics software developers ever want to overhear the following type of conversation:

Bob: "You should use that tool"

Sue: "What tool?"

Bob: "Umm, you know that PCA thingy. The S…something, something…PCA tool"

Sue: "The what?"

Bob: "Run a Google search for Bioinformatics PCA tools, it's probably the top hit."

Sue: <- facepalm ->

The Francis Crick Institute has signed the Hague Declaration

The Hague Declaration is an important manifesto that aims to provide guidelines for how to "best enable access to facts, data and ideas for knowledge discovery in the Digital Age". Although signatories to the declaration include large scientific research institutes, you can also sign the declaration as an invididual. The five main principles of the declaration are summarized as follows:

  1. Intellectual property was not designed to regulate the free flow of facts, data and ideas, but has as a key objective the promotion of research activity
  2. People should have the freedom to analyse and pursue intellectual curiosity without fear of monitoring or repercussions
  3. Licenses and contract terms should not restrict individuals from using facts, data and ideas
  4. Ethics around the use of content mining techniques will need to continue to evolve in response to changing technology
  5. Innovation and commercial research based on the use of facts, data, and ideas should not be restricted by intellectual property law

These principles are obviously of huge relevance for the field of genomics which seems to be generating tools and data at an ever increasing rate. So I was happy to read today that the new Francis Crick Institute in London is one of the Declaration's latest signatories:

"The large amounts of data and information that are now becoming available represent an extraordinary resource for researchers. By signing the Hague Declaration the Francis Crick Institute is expressing its support for the idea that researchers should be able to mine such content freely, thereby to advance knowledge and to promote Discovery without Boundaries."

Jim Smith, Director of Research at the Francis Crick Institute

Front Line Genomics interview with Craig Venter includes a question from yours truly

Issue 4 of the Front Line Genomics magazine is now available online. It includes an interview with Craig Venter who gave a much anticipated talk at their recent Festival of Genomics conference in Boston. Front Line Genomics kindly allowed some of their previous interviewees (which includes me) to pose some of the questions. Here's mine:

KRB: What do you see as the limits of synthetic biology? Could we assemble a functional eukaryotic genome, and what are the practical applications of such technology?

JCV: That’s a great question! The limitations will ultimately be more society limitations, ethical limitations, and standards rather than technology. I think a synthetic single eukaryotic cell would be very straightforward to do today. Various groups of scientists have been trying to build the yeast genome. It’s kind of like rebuilding a house one brick at a time, but they’re making a synthetic version of yeast. That’s not quite the same as writing the genetic code and then booting it up as we did, but that’s just because of the limitations on writing the genetic code now.

I think understanding what makes a multicellular organism, and all the regulation associated with that, are so far away from design that we’re going to have to learn a whole lot more biology before we get to that stage of deliberate design. I think about 10% of the genes in our designed synthetic bacterial cell, are of unknown function. All we know is that you can’t get life without them. That problem expands tremendously with eukaryotic cells. If you extrapolate to the challenge of interpreting the human genome, we only understand a tiny fraction of the human genome today.

Get With the Program: DIY tips for adding coding to your analysis arsenal

A new article in The Scientist magazine by Jeffrey M. Perkel shares some coding advice from Cole Trapnell, C. Titus Brown, and Vince Buffalo (I interviewed Vince in my last blog post). It is a great article, and worth a look. I particularly enjoyed this piece of advice (something that is not mentioned enough):

Treat data as "read-only"
Use an abundance of caution when working with your hard-won data, Buffalo says. For instance, “treat data as read-only.” In other words, don’t work with original copies of the data, make working copies instead. “If you have the data in an Excel spreadsheet and you make a change, that original data is gone forever,” he says.

I have seen too many students double click on FASTA, GFF, and other large bioinformatics text files and end up 'viewing' them in some inappropriate program (including Microsoft Word). If you want to view text data, use a text viewer (such as less).

101 questions with a bioinformatician #30: Vince Buffalo

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Vince is a second year graduate student in the lab of Graham Coop at UC Davis. Before that he earned his bioinformatics 'chops' working in other groups on the UC Davis campus as a bioinformatician and statistical programmer.

I came to know Vince when he was working as part of the Genome Center's Bioinformatics Core Facility; I was immediately impressed, not only by his diverse set of computational skills, but by the way he applied those skills. Put simply, Vince does things the right way. He believes that bioinformatics should be a carefully documented, reproducible science. He also sees the strengths and advantages of using core Unix skills to organize and manage bioinformatics pipelines. These skills will provide a more useful, and lasting, toolbox than if you only ever learn how to use the latest and greatest set of published bioinformatics tools.

Impressively, Vince has recently published a book (Bioinformatics Data Skills by O'Reilly), this is something that I highly encourage people to buy, and I'm convinced that it will become an indispensible guide to everyone working in this field. In the book's introduction, he neatly states the problem that I alluded to earlier:

Many biologists starting out in bioinformatics tend to equate “learning bioinformatics” with “learning how to run bioinformatics software.” This is an unfortunate and misinformed idea of what bioinformaticians actually do. This is analogous to thinking “learning molecular biology” is just “learning pipetting." … the approach of this book is to focus on the skills bioinformaticians use to explore and extract meaning from complex, large bioinformatics datasets.

You can find out more about Vince by visiting his 'digital notebook' website at vincebuffalo.org, or by following him on twitter @vsbuffalo. And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

Watching bioinformatics grow to tackle exciting evolutionary questions,especially with non-model organisms. While bioinformatics has clearly revolutionized the human genomics field, I think in the next decade we'll see interesting developments in bioinformatics tailored to problems in complex non-model organism genomics.

I love plants and have worked in plant genomics, and I've seen first hand that it's very hard. Many bioinformatics tools we used were designed to work with human data, not gigantic polyploid genomes. It will be exciting over the next few years to see how reads grow in length, new algorithms emerge, and how this will enable more non-model research. As a budding evolutionary biologist, I'm hopeful that these bioinformatics advances will fuel more discoveries in neat species that have traditionally been harder to work with.



010. What's something that you don't enjoy about current bioinformatics research?

A large proportion of a bioinformatician's time is spent tackling unnecessary human-made problems: data is poorly organized, file formats are both poorly specified and followed, and software is often poorly documented or isn't robust to different data. These are neither interesting scientific problems nor fun computational problems — these are frustrating social and community issues. No one wants to tackle these problems for that reason, but at some point we'll have to as a community — to avoid wasting our collective time on these annoyances.



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Study more mathematics. I fell in love with statistics before I did math because I quickly saw the beauty in using statistics to understand data. Now I'm working backwards and trying to bolster my maths skills and seeing the beauty in other mathematical fields and really enjoying it. Darwin said "mathematics seems to endow one with something like a new sense" — I'd argue that this is especially true in biology.



100. What's your all-time favorite piece of bioinformatics software, and why?

It's a tie — SAMtools and PSMC. SAMtools is an amazing piece of engineering — from an algorithmic perspective, from a usability perspective, and from a community perspective. If you dig inside the source, everything is so cleverly written and carefully optimized (e.g. the klib library). I've learned a lot of C tricks from reading Heng Li's code.

SAMtools is also extremely well designed from the user perspective — it adopts the Unix philosophy and its subcommand interface is much like Git's. However, SAMtools is not a perfect program; there have been numerous bugs found over the years and some folks attack it for this. But these bugs are quickly patched thanks to active development and an excellent community. I don't work on SAMtools (other than one tiny bug fix) but I enjoy following along via GitHub and reading and learning from the source.



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

S — and it's a simple puzzle why this is the letter I chose.

ORCID: Over 1.5 million IDs served

I was pleased to read that there are now over 1.5 million ORCID IDs in existence. If you didn't know, ORCID provide unique identifiers to researchers. Once you have an ORCID identifier, you can start linking all of your research to that identifier. If you change name, or if you suffer from having a very common surname, ORCID makes it easier to track your contributions to science.

From their about page:

ORCID is an open, non-profit, community-driven effort to create and maintain a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers. ORCID is unique in its ability to reach across disciplines, research sectors and national boundaries. It is a hub that connects researchers and research through the embedding of ORCID identifiers in key workflows, such as research profile maintenance, manuscript submissions, grant applications, and patent applications.

I'm hopeful that ORCID will one day become the glue to tie together all scientific output. Written a blog post, or grant, or git commit message for some piece of scientific software? There is no reason why these couldn't all be 'digitally signed' with your ORCID identifier.

If you don't have an ORCID ID — and yes I appreciate that this is somewhat of a clumsy nomenclature — you really should register now.

Mining UniProt: the Roy Chaudhuri quest to find DNA-like protein sequences

Image adapted from original image by flickr user lwr

So I recently came up with this idea for #tweetsascode: using twitter to write tweets which contain functional programs within their 140 characters. I posted one such example last Friday, a Perl script which checks a FASTA file (specified on the command-line), in order to determine whether it contains a protein or DNA sequence:

#!/usr/bin/perl
use strict; 
use warnings;

while(<>){
    next if m/(^>)|(^$)/;
    die "Protein" if (m/[EFILOPQ]/i);
    die "DNA";
}
#tweetsascode

This code skips FASTA definition lines (those starting with '>') and blank lines, and then asks: does the first line of sequence contain any of the seven amino-acid characters which are not IUPAC nucleotide characters? If so, then it must be protein; otherwise the sequence is probably DNA.

This led @RoyChaudhuri to comment:

Roy's point is that there are so many IUPAC nucleotide characters, that a protein sequence which only contained 13 out of the 20 canonical amino acids, would also pass the test as a valid nucleotide sequence. Is it possible to therefore determine how many 'DNA-like' proteins there are?

Experiment

With the help of a little Perl script, I did the following:

  1. I first downloaded the FASTA files for SWISS-PROT and TrEMBL, which collectively comprise the UniProt protein database. If you didn't know, SWISS-PROT contains manually annotated entries whereas the much larger TrEMBL database is automatically annotated.
  2. For every protein sequence in SWISS-PROT or TrEMBL, my script counts the use of various protein ambiguity characters (this was just out of curiosity). These are B (aspartic acid or asparagine), J (leucine or iosoleucine), Z (glutamate or glutamine), and X (unknown amino acid).
  3. The script also counts usage of the 21st and 22nd amino acids (selenocysteine and pyrrolysine, which have the valid IUPAC characters U and O respectively).
  4. The script counted any protein sequences which only contained amino acids that have equivalent IUPAC characters for the set of four canonical nucleotides (i.e. alanine, cysteine, glycine, and threonine).
  5. Finally, the script counted any protein sequences which only contained amino acids that have equivalents from any of the 16 IUPAC nucleotide characters.

Results (SWISS-PROT)

Dataset = 549,008 proteins

  • 546,360 only contained the 20 'classic' amino acid characters
  • 254 contained selenocysteine characters (U)
  • 29 contained pyrrolysine characters (O)
  • 138 contained alternative amino acid character B (representing D or N)
  • 0 contained alternative amino acid character J (representing L or I)
  • 114 contained alternative amino acid character Z (representing E or Q)
  • 2,222 contained unknown amino acid characters (X)
  • Only 1 protein was comprised entirely of A, C, G, and T
  • An additional 123 proteins were comprised entirely from valid IUPAC nucleotide characters ([ACGTNUWSKMRYBDHV])

The sequence that contained only 'classic' DNA characters was a 31 amino acid fragment, which turned out to contain only two different amino acids (alanine and threonine):

>sp|P02732|ANP3_PAGBO Ice-structuring glycoprotein 3 (Fragments) OS=Pagothenia borchgrevinki PE=1 SV=1 AATAATAATAATAATAATAATAATAATAATA

Of the 123 proteins that used various characters from the full set of IUPAC nucleotide characters, this 128 amino acid protein was the longest:

>sp|Q925H4|KR211_MOUSE Keratin-associated protein 21-1 OS=Mus musculus GN=Krtap21-1 PE=2 SV=2 MCCNYYGNSCGGCGYGSRYGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYG SGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSG YGCGYGSRYGCGYGSGCCSYRKCYSSCC

This SWISS-PROT entry (accession Q925H4) is a mouse protein which has experimental evidence, and which has an Annotation score of 5 out of 5.

Results (TrEMBL)

Dataset = 50,011,027 proteins

  • 49,373,499 only contained the 20 'classic' amino acid characters
  • 1697 contained selenocysteine characters (U)
  • 199 contained pyrrolysine characters (O)
  • 15,314 contained alternative amino acid character B (representing D or N)
  • 0 contained alternative amino acid character J (representing L or I)
  • 5,842 contained alternative amino acid character Z (representing E or Q)
  • 632,742 contained unknown amino acid characters (X)
  • Only 2 proteins were comprised entirely of A, C, G, and T
  • An additional 1,827 proteins were comprised entirely from valid IUPAC nucleotide characters ([ACGTNUWSKMRYBDHV])

In this much larger set of proteins we still don't see a sequence that resembles 'classic DNA' that is any longer than the 31 amino acid fragment found in SWISS-PROT. Instead, the longest sequence was a 24 amino acid fragment (which only contains alanine and glycine):

>tr|U6PUI2|U6PUI2HAECO ISE/inbred ISE genomic scaffold, scaffoldpathogensHcontortusscaffold6340 (Fragment) OS=Haemonchus contortus GN=HCOI01698300 PE=4 SV=1
GAAAAGGGGGGGGGGGGAAAAGGA

However, in TrEMBL there was a much longer sequence which contains various characters from the full set of IUPAC nucleotide characters. This uncharacterized protein is 495 amino acids long, and contains mostly serine, arginine, and cysteine:

>tr|A0A0E9H024|A0A0E9H024STREE Uncharacterised protein OS=Streptococcus pneumoniae GN=ERS23250802220 PE=4 SV=1 MRSRSYYTSVSRRKSSSSSSRSSSSSRSSSSCSSCRSSSSSRSSSSCRSS SSCSSCRSSRSSRSSSSCSSSRSCRSCSSSRSCSSCRSSSSCSSCRSSRS SRSSSSSRSSRSSSSCSSSRSSSSCRSSRSSRSSRSSRSCSSCRSSSSCS SCRSCRSSSSCSSCRSSRSSRSSSSCRSSSSCSSCSSCRSSRSSSSCSSS RSCRSCRSCRSSSSSSSCSSSRSSRSSSSSRSCSSCRSSSSCSSCRSSSS CSSCRSSRSSSSCSSCSSCRSSRSSSSCSSSRSSRSSRSCRSSSSSRSCR SCSSCRSSSSCSSCRSSRSSRSCSSCRSSRSSRSCRSSSSSRSSRSSSSS RSSRSSSSCRSSRSSSSSRSCSSCRSSRSSSSCSSSRSSSSSRSCSSSRS CSSCRSSSSCSSCRSSRSSRSSSSSRSSRSSSSCSSSRSSSSCRSSRSSR SSRSSRSSSSSRSSRSSSSSRSSRSSSSCRSCRSSSSCSSCSSSR

Honorable mention to XXX-rated protein

It is worth giving a shout out to UniProt accession W4XLU5 (from the TrEMBL database). This uncharacterized protein has a length of 21,842 amino acids…21,292 of which are represented by unknown amino acids!!! This is probably why the protein has an Annotation score of just 1 out of 5.

Conclusions

  1. To answer Roy's initial question, only 0.00004% of proteins in UniProt (1,953 / 50,560,035) fulfil the requirement of only containing amino acids that have equivalent IUPAC nucleotide characters.
  2. From a coding point of view, one should possibly account for the fact that you can see almost 500 DNA-like characters in a sequence, but you still could be looking at a protein sequence.
  3. A ~22,000 amino acid protein which contains 97% 'unknown' residues, should maybe take the award for 'least-useful protein annotation'

Insights into the mind and practices of a bioinformatics developer

Michael Barton  — who has the coolest twitter handle: @bioinformatics — is a talented bioinformatician who has a great, though infrequently updated, blog (Bioinformatics Zen).

His latest post (How I Develop) takes a look at his overall coding and working practices, reflecting on his decade long experience in bioinformatics. I particularly like the section of 'Focus':

Getting up very early. I built nucleotid.es by getting up at 6am every morning for about 6 months. An average of 2-3 hours of time everyday allowed me to create the prototype site to prove that this could be useful. These hours in the morning feel like my most productive and usually there's no one else awake to interrupt you.

Well worth a read, as is his FAQ.

Does your bioinformatics software pass the 'elevator test'?

The name of your bioinformatics software is important. A good name should be clear, unambiguous, pronouncable, memorable, and meaningful. Sadly many (most?) names of existing tools do not satisfy all of these criteria. Here is a simple thought experiment that you can use when trying to decide on a new name for your software; this is something which might help you avoid many common naming problems that can arise.

Imagine that you are in an elevator going from the 6th floor of a building to the ground floor. The elevator stops at the 5th floor and a visiting bioinformatics/genomics scholar steps in. He/she is someone that you admire and someone who you would really like to know about the latest software tool that you've been working on.

They press the button for the 2nd floor. You have maybe 30 seconds to introduce the tool and hopefully make them curious enough to check it out when they next get back to their computer. You say something like:

Hi. I'm a big fan of your work. I wanted to let you know that I've been working on a tool that you might be interested in…it's called 'X'

In this example, we will assume that you may never see this person again and that you don't know when they will have time to look up your software tool. It might be days, so the name has to be something that they will remember. The more meaningful and pronouncable the name, the more chance that it will be memorable.

Now, let's consider the names of some recently published bioinformatics tools…do these pass the elevator test? You should always consider how you might have to spell out the name of your software:

  • tmle.npvi — tee-em-el-ee-dot-en-pee-vee-aye
  • EW_dmGWAS — Ee-double-you-underscore-dee-em-gee-was
  • do_x3dna — dee-oh- (or do?) -underscore-ex-three-DNA
  • R3D-2-MSA — ar-three-dee-dash-two-dash-em-ess-ay 
  • Pse-in-One — pee-ess-ee- (or see?) -dash-in-dash-one
  • (PS)2 — open-parentheses-pee-ess-close-parentheses-superscript-two

In these examples you would probably choose to omit details of the dots, dashes, underscores, parentheses, and superscript characters that are part of the name. So you should ask yourself whether you really need to include them in the first place.

The bottom line is that it is not enough for the name of your sofware to be comprehensible when read from a screen or page…it should also sound good!