24 carat JABBA awards

jabba logo.png

Here is a new paper published in the journal PLOSBuzzFeed…sorry, I mean PLOS Computational Biology:

It's a good job that they mention the name of the algorithm ninety-one times in the paper, otherwise you might forget just how bogus the name is. At least DIAMOnD has that lower-case 'n' which means that no-one will confuse it with:

This second DIAMOND paper dates all the way back to November 2014. Where does this DIAMOND get its name?

Double Index AlignMent Of Next-generation sequencing Data

This DIAMOND gets a bonus point for having a website link in the paper which doesn't seem to work.

So DIAMOnD and DIAMOND are both the latest recipients of JABBA awards for giving us Just Another Bogus Bioinformatics Acronym.

101 questions with a bioinformatician #24: Sara Gosline

Sara Gosline is a postdoc at the Fraenkel Lab, in the Department of Biological Engineering at MIT. Her current work has focused on studying the impact of microRNA changes on global mRNA expression. As her postdoc comes to an end, Sara is seeking a tenure-track faculty position to further explore the broader impacts of RNA regulation to better interpret gene expression data in a network context (contact her if interested).

Read More

Excellent blog post about coding and documentation

There was an exchange on twitter today between several bioinformaticians regarding the need to have good documentation for bioinformatics tools. I was all set to write something about my own thoughts on this topic, but Robert Davey (@froggleston) has already written an excellent post on the subject (and probably done a better job of expressing my own views than I could):

I highly recommend reading his post as he makes some great points, including the following:

We need, as a community, usable requirements and standards for saying “this is how code should go from being available to being reusable“. How do we get our lab notebook code into that form via a number of checkpoints that both programmers and reviewers agree on?


Transcriptional noise, isoform prediction, and the utility of mass spec data in gene annotation

The human genome may be 80% functional or 8.2% functional. Maybe it's 93.7% functional or only 6.1%. I guess that all we know for sure is that it is not 0% functional (although my genome on a Monday morning may provide evidence to the contrary).

Transcript data can be used to ascribe some sort of functionality to a genome and, in an ideal world, we would sequence full-length cDNAs for every gene. But in the less-than-ideal world we often end up sequencing lots of small bits of RNA using an ever-changing set of technologies. ESTs, SAGE, CAGE, RACE, MPSS, and RNA-Seq have all been used to provide evidence for where genes are and how highly they are being expressed.

Having some transcriptional evidence is (usually) better than not having any transcriptional evidence, but it doesn't necessarily imply functionality. A protein-coding gene that is transcribed may not be translated. Transcript data is used in gene annotation to add new genes, especially in the case of a first-pass annotation of a new genome. But in established genomes, it is probably used more to annotate transcript isoforms (e.g. splice variants). This can lead to a problem for the end users of such data…how to tell if all isoforms are equally likely?

Consider the transcript data for the rpl-22 gene in Caenorhabditis elegans. This gene has two annotated splice variants and there is indeed EST evidence for both variants, but it is a little bit unbalanced:

This gene encodes the large ribosomal subunit protein…a pretty essential protein! Notice how the secondary isoform (shown on top) a) encodes for a much shorter protein and b) has very little transcript evidence. In my mind, this secondary isoform is the result of 'transcriptional noise'. Maybe a couple of lucky ESTs captured the transcript in the process of heading towards destruction via nonsense-mediated decay? It seems highly unlikely that this secondary transcript gives rise to a functional protein though someone who is new to viewing data like this might initially consider each isoform as equally valid.

If we turn on some additional data tracks to look at protein homology to human (shown in orange) and mass spectromety data from C. elegans (shown in red) it becomes clear that all of the evidence is really pointing towards just one functional isoform:

Indeed mass spec data has the potential to really clean up a lot of noisy gene annotations. In light of this I was very happy to see this new paper published in the Journal of Proteome Research (where talented up-and-coming scientists publish!):

Pooling data from 8 mass spec analyses of human data, the authors attempted to see how much protein support there was for the different annotated isoforms of the human genome. They could reliably map peptides to about two-thirds of the protein-coding genes from the GENCODE 20 gene set (Ensembl 76). What did they find?

We identified alternative splice isoforms for just 246 human genes; this clearly suggests that the vast majority of genes express a single main protein isoform.

They also found that the mass spec data was not always in agreement with the dominant isoforms that can be predicted from RNA-Seq data:

…part of the reason for this will be that more RNAseq reads map to longer sequences, it does suggest that either transcript expression is very different from protein expression for many genes or that transcript reconstruction methods may not be interpreting the RNAseq reads correctly.

The headline conclusion that mass spec evidence only supports alternate isoforms for 1.2% of human genes is thought provoking. It suggests to me that we should be careful in relying too heavily on gene annotations which describe large numbers of isoforms mostly on the basis of transcript data. Paraphrasing George Orwell:

All isoforms are equal, but some isoforms are more qual than others

The top 10 #PLOSBuzzFeed tweets that will put a smile on your face

It all started so innocently. Nick Loman (@pathogenomenick) expressed his dissatisfaction with yet another PLOS Computational Biology article that uses the 10 Simple Rules… template:


There were two immediate responses from Kai Blin (@kaiblin) and Phil Spear (@Duke_of_neural):


I immediately saw the possibility that this could become a meme-worthy hashtag, so I simply retweeted Phil’s tweet, added the hashtag #PLOSBuzzFeed, and waited to see what would happen (as well making some of my own contributions).

At the time of writing — about 10 hours later — there have been several hundred tweets using this hashtag. Presented in reverse order, here are the most ‘popular’ tweets from today (as judged by summing retweets and favorites):

How to cite bioinformatics resources

I saw a post on Biostars today that asked how specific versions of genome assemblies should be cited. This question also applies to the more general issue of citing any bioinformatics resource which may have multiple releases/versions which are not all formally published in papers. Here is how I replied:

Citing versions of any particular bioinformatics/genomics resources can get tricky because there is often no formal publication for every release of a given dataset. Further complicating the situation is the fact that you will often come across different dates (and even names) for the same resource. E.g. the latest cow genome assembly generated by the University of Maryland is known as 'UMD 3.1.1'. However, the UCSC genome browser uses their own internal IDs for all cow genome assemblies and refers to this as 'bosTau8'. Someone new to the field might see the UCSC version and not know about the original UMD name.

Sometimes you can use dates of files on FTP sites to approximately date sequence files, but these can sometimes change (sometimes files accidentally get removed and replaced from backups, which can change their date).

The key thing to aim for is to provide suitable information so that someone can reproduce your work. In my mind, this requires 2–3 pieces of information:

  1. The name or release number of the dataset you are downloading (provide alternate names when known)
  2. The specific URL for the website or FTP site that you used to download the data
  3. The date on which you downloaded the data

E.g. The UMD 3.1.1 version of the cow genome assembly (also known as bosTau8) was downloaded from the UCSC Genome FTP site (ftp://hgdownload.cse.ucsc.edu/bosTau8/bigZips/bosTau8.fa.gz).

When no version number is available — it is very unhelpful not to provide version numbers of sequence resources: they can, and will change — I always refer to the date that I downloaded it instead.

Get shorty: the decreasing usefulness of referring to 'short read' sequences

I came across a new paper today:

When I see such papers, I always want to know 'what do you consider short?'. This particular paper makes no attempt to define what 'short' refers to, with the only mention of the 'S' word in the paper being as follows:

Finally, qAlign stores metadata for all generated BAM files, including information about alignment parameters and checksums for genome and short read sequences

There are hundreds of papers that mention 'short read' in their title and many more which refer to 'long read' sequences.

But 'short' and 'long' are tremendously unhelpful terms to refer to sequences. They mean different things to different people, and they can even mean different things to the same person at different times. I think that most people would agree that Illumina's HiSeq and MiSeq platforms are considered 'short read' technologies. The HiSeq 2500 is currently capable of generating 250 bp reads (in Rapid Run Mode), yet this is an order of magnitude greater than when Illumina/Solexa started out generating ~25 bp reads. So should we refer to these as long-short reads?

The length of reads generated by the first wave of new sequencing technologies (Solexa/Illumina, ABI SOLiD, and Ion Torrent) were initially compared to the 'long' (~800 bp) reads generated by Sanger sequencing methods. But these technologies have evolved steadily. The latest reagent kits for the MiSeq platform offer the possibility of 300 bp reads. However, if you perform paired end sequencing of libraries with insert sizes of ~600 bp, then you may end up generating single consensus reads that approach this length. Thus we are already at the point where a 'short read' sequencing technology can generate some reads that are longer than some of the reads produced by the former gold-standard 'long read' technology.

But the read lengths of any of these technologies pales into comparison when we consider the output of instruments from Pacficic Biosciences and Oxford Nanopore. By their standards, even Sanger sequencing reads could be considered 'short'.

If someone currently has reads that are 500-600 bp in length, it is not clear whether any software tool that proclaims to work with 'short reads' is suitable or not. Just as the 'Short Read Archive' (SRA) became the more-meaningfully-named Sequence Read Archive, so we as a community should banish these unhelpful names. If you develop tools that are optimized to work with 'short' or 'long' read data, please provide explicit guidelines as to what you mean!

To conclude:

There are no 'short' or 'long' reads, there are only sequences that are shorter or longer than other sequences.

101 questions with a bioinformatician #23: Todd Harris

Todd Harris is a Bioinformatics Consultant and Project Manager at WormBase. I first came to know Todd when I was also working on the WormBase project. As part of the UK operation (based at the Sanger Institute), we would frequently refer to him as 'SuperTodd' for his amazing skills at single-handedly keeping the WormBase website updated and working smoothly.

Read More

Details of GFF version 4 have emerged

gff4.png

One of the most widely used file formats in bioinformatics is the General Feature Format (GFF). This venerable tab-delimited format uses 9 columns of text to help describe any set of features that can be localized to a DNA or RNA sequence.

It is most commonly used to provide a set of genome annotations that accompany a genome sequence file, and the success of this format has also spawned the similar Gene Transfer Format (GTF), which focuses on gene structural information.

GFF has been an evolving format, and the widely adopted 2nd version has largely been superceded by use of GFF version 3. This was developed by Lincoln Stein from around 2003 onwards.

As version 3 is now over a decade old, work has been ongoing to develop a new version of GFF 4 that is suitable for the rigors of modern day genomics. The principle change to version 4 will be the addition of a 10th GFF column. This 'Feature ID' column is defined in the spec as follows:

Column 10: Feature ID

Format: FeatureID=<integer>

Every feature in a GFF file should be referenced by a numerical identifier which is unique to that particular feature across all GFF files in existence.

This field will store an integer in the range 1–999,999,999,999,999 (no zero-padding) and identifiers will be generated via tools available from the GFF 4 consortium. If you wish to generate a GFF 4 file, you will need to obtain official sanctioned Feature IDs for this mandatory field.

The advantage of this new field is that all bioinformatics tools and databases will have a convenient way to uniquely reference any feature in any GFF file (as long as it is version 4 compliant)

Large institutions may wish to work with the GFF 4 consortium to reserve blocks of consecutive numeric ranges for Feature IDs

It is intended that the GFF 4 consortium will act as a gatekeeper to all Feature IDs, and that via their APIs you will be able to check whether any given Feature ID exists, and if it does you will be able extract the relevant details of that feature from whatever GFF file in the world contains that specific Feature ID.

Here is an example of how GFF version 4 would describe an intron from a gene:

## gff-version 4
## sub-version 1.02
## generated: 2015-02-01
## sequence-region   chr1 1 2097228       
chrX    Coding_transcript   intron 14192   14266   .   -   gene=Gene00071  FeatureID=125731789

In this example, the intron is the 125,731,789th feature to be registered globally with the GFF 4 consortium. The big advantage of this format is a researcher can now guarantee that this particular Feature ID will not exist in any other GFF file anywhere in the world. The use of unique identifiers like this will be a huge leap forward for bioinformatics as we will no longer have to worry about lines in our GFF files possibly existing in someone else's GFF files as well.

Update: check the date

Bogus bioinformatics acronyms…there's a lot of them about

Time for some new JABBA awards to recognize the ongoing series of crimes perpetrated in the name of bioinformatics. Two new examples this week…

 

Exhibit A (h/t @attilacsordas): from arxiv.org we have…

CoMEt derives from 'Combinations of Mutually Exclusive Alterations'. Of course the best way of making it easy for people to find your bioinformatics tool is to give it an identical name as an existing tool which does something completely different. So don't be surprised if you search for the web for 'CoMEt' only to find a bioinformatics tool called 'CoMet' from 2011 (note the lower-case 'e'!). CoMet is a web server for comparative functional profiling of metagenomes.

 

Exhibit B: from the journal Bioinformatics — the leading provider of bogus bioinformatics acronyms since 1998 — we have…

MUSCLE is derived from 'Multi-platform Unbiased optimization of Spectrometry via Closed-Loop Experimentation'. Multi-platform you say? What platforms would those be? From the paper:

MUSCLE is a stand-alone desktop application and has been tested on Windows XP, 7 and 8

What, no love for Windows Vista?

Of course, it should be obvious to anyone that this bioinformatics tool called MUSCLE should in no way be confused with the other (pre-existing) bioinformatics tool called MUSCLE.