How can you choose a single isoform to best represent a gene?

I asked this question a few weeks ago. There are some situations in bioinformatics when you want to look at all genes from one or more organisms, but where you want to only have one representative of a gene. This is not always a straightforward task, and I previously discussed how many people opt for overly simplistic methods such as 'choosing the longest isoform'.

Since my blog post I had some feedback and have also come across a relevant paper which I would like to share here.


Sara G said:

We use epigenetic data from ENCODE/modENCODE/Epigenetic roadmap or our own data to identify the transcription start site that has the most evidence of transcriptional activity.

From my experience with using data from fly, worm, and Arabidopsis I have often seen many genes which have multiple isoforms that all share the same transcription start site (and differ in downstream exon/intron structure).

Michael Paulini commented:

What you can easily do (if you got the data) is just pick the isoform with the highest FPKM value per gene.

The important caveat that Michael mentions is that you might not have the data, and if you are trying to do this as part of multiple species comparison then this becomes a much more complex task.

Richard Edwards added to the debate (I've edited his response a little bit):

My previous solution to this problem was this:

"…a dataset consisting of one protein sequence per gene was constructed from the EnsEMBL human genome…by mapping all EnsEMBL human peptides onto their genes and assessing them in the context of the external database entry used as evidence for that gene. If the external database was SwissProt and one of the peptides has the exact same sequence as the SwissProt sequence, this peptide sequence was used for that gene. In all other cases, the longest peptide (in terms of non-X amino acids) was used."

Richard also conceded that this approach "only works for well annotated proteomes".

Another approach

For Arabidopsis thaliana, I have previously used data from their exon confidence ranking system which ends up producing a 5 star rating for every transcript of a gene. This is based on various strands of evidence and can be useful for choosing between isoforms which have different star ratings (but not so helpful when all isoforms have the same rating). The TAIR database remains the only model organism database (that I know of) that have attempted to address this problem and provide a systematic way of ranking all genes. Anecdotally, I would say that not many people know about this system and you need to find the relevant files on their FTP site to make use of this information.

Hopefully someone will correct me if other databases have implemented similar schemes!

From the literature

If your research focuses on vertebrates, then this approach looks promising.

Here's the abstract (emphasis mine):

Here, we present APPRIS (, a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform.

It seems that since this paper was published, they have expanded their dataset and the online APPRIS database now has annotations for five vertebrate genomes (human, mouse, zebrafish, rat, and pig).

The future

This APPRIS approach seems promising, yet I wish there was an standardized ranking system used by all model organism databases (as well as Ensembl, UCSC Genome Browser etc.).

There are always going to be different levels of supporting evidence available in different species. E.g. not every organism is going to be able to make use of mass spec data to assess the differential usage of transcript variants. Furthermore, there are going to be tissue-specific — not to mention time-specific — patterns of differential isoform expression for some genes.

However, what I would love to see is each database using whatever information they have availble to them to assign relative 'weights' to each annotated isoform. Initially, this would focus on protein-coding genes and ideally, this scheme would entail the following:

  1. A score for every annotated isoform of a gene where all scores sum to 100. In my favorite example of the Caenorhabditis elegans rpl-22 gene these scores would probably end up as 99 for the dominant isoform and 1 for the secondary isoform. The scores should reflect the relative likelihood that this transcript encodes a functional protein. To keep things simple, I would prefer integer values with values of zero allowed for annotated isoforms that represent pseudogenic transcripts, targets of nonsense-mediated decay etc.
  2. Evidence for the score. This would be something akin to the Gene Ontology Evidence codes. Each gene may have one or more codes attached to them and these would reflect the source of the experimental evidence (transcript data, mass spec, functional assay etc.)
  3. Spatial/temporal information. If some of the supporting evidence is from a particular cell-line, tissue, or developmental time point then it should be possible to annotate this.
  4. Historical information. Such scores should be expected to change over time as new evidence emerges. All transcripts should keep a time-stamped history of previous scores to allow researchers to see whether isoforms have increased or decreased in their relative probability of usage over time.
  5. Data to be added to GTF/GFF files: The score of each transcript should be embedded as a note in the attribution column of these files, maybe as a 'RF' score (Relative Functionality).

Can someone please make this happen? :-)

How do people choose a single isoform of a gene to use for bioinformatics analyses?


Update 2015-09-29: in addition to the comments at the end of the post below, also see the follow up post that I wrote which offers some more suggestions including the APPRIS database/webserver which looks very useful.


This post is somewhat of a follow-up to something that I wrote earlier this week. In bioinformatics, we often want to analyze all genes from an organism (or from multiple organisms). In many well-annotated genome databases, there is often a choice of isoforms available for each protein-coding gene, and the number of isoforms only ever seems to increase.

For example, in the latest set of human gene annotations (Ensembl 78), there are 406 protein-coding genes that have more than 25 transcripts. At one extreme, the human GPR56 gene has 77 transcripts, 61 of which are annotated as protein-coding! The length of these 61 putative protein products ranges from just 6 amino acids (!) all the way up to 693.

In Caenorhabditis elegans, sequence identifiers for genes were historically based on appending numbers to the identifier of the BAC/YAC/Cosmid clone containing that gene. E.g. B0348.1 would represent the first predicted gene on the B0348 clone, B0348.2 the second gene…and so on. When splice variants were discovered, curators appended letters for each isoform. E.g. B0348.2a and B0348.2b represent the two alternative isoforms of this gene. In the latest WS248 release of WormBase, one gene (egl-8) has 25 isoforms (all the way up to B0348.4y). I wonder what WormBase will do when a 27th isoform is discovered?

So how does one attempt to choose a single variant for use in a bioinformatics pipeline, and is this something that we should even be attempting? Historically, people have often opted for a quick-and-easy approach in order to get around this problem. Some examples from papers indexed by Google Scholar:

"In cases of alternative splicing, we chose the longest protein to represent a gene"

"In cases of multiple transcript isoforms, we chose the isoform with the longest CDS supported by transcript and protein homology in other mammalian species"

"Because of the redundancy of protein sequences, we chose only the longest isoform for every entry"

"In cases where a gene possesses more than one reference sequence, we chose the longest"

"When multiple protein entries are found for the same EntrezGene identifier, choose the longest sequence isoform"

This methodology is obviously not without problems (as others have reported on). So I'm genuinely curious as to what people do in order to choose a 'representative' isoform (whatever that means). The problem is further complicated when the reality might be that some genes consistently use different isoforms in different tissues or at different developmental time points.

Please comment below if you think you have found a good solution to this problem!

Transcriptional noise, isoform prediction, and the utility of mass spec data in gene annotation

The human genome may be 80% functional or 8.2% functional. Maybe it's 93.7% functional or only 6.1%. I guess that all we know for sure is that it is not 0% functional (although my genome on a Monday morning may provide evidence to the contrary).

Transcript data can be used to ascribe some sort of functionality to a genome and, in an ideal world, we would sequence full-length cDNAs for every gene. But in the less-than-ideal world we often end up sequencing lots of small bits of RNA using an ever-changing set of technologies. ESTs, SAGE, CAGE, RACE, MPSS, and RNA-Seq have all been used to provide evidence for where genes are and how highly they are being expressed.

Having some transcriptional evidence is (usually) better than not having any transcriptional evidence, but it doesn't necessarily imply functionality. A protein-coding gene that is transcribed may not be translated. Transcript data is used in gene annotation to add new genes, especially in the case of a first-pass annotation of a new genome. But in established genomes, it is probably used more to annotate transcript isoforms (e.g. splice variants). This can lead to a problem for the end users of such data…how to tell if all isoforms are equally likely?

Consider the transcript data for the rpl-22 gene in Caenorhabditis elegans. This gene has two annotated splice variants and there is indeed EST evidence for both variants, but it is a little bit unbalanced:

This gene encodes the large ribosomal subunit protein…a pretty essential protein! Notice how the secondary isoform (shown on top) a) encodes for a much shorter protein and b) has very little transcript evidence. In my mind, this secondary isoform is the result of 'transcriptional noise'. Maybe a couple of lucky ESTs captured the transcript in the process of heading towards destruction via nonsense-mediated decay? It seems highly unlikely that this secondary transcript gives rise to a functional protein though someone who is new to viewing data like this might initially consider each isoform as equally valid.

If we turn on some additional data tracks to look at protein homology to human (shown in orange) and mass spectromety data from C. elegans (shown in red) it becomes clear that all of the evidence is really pointing towards just one functional isoform:

Indeed mass spec data has the potential to really clean up a lot of noisy gene annotations. In light of this I was very happy to see this new paper published in the Journal of Proteome Research (where talented up-and-coming scientists publish!):

Pooling data from 8 mass spec analyses of human data, the authors attempted to see how much protein support there was for the different annotated isoforms of the human genome. They could reliably map peptides to about two-thirds of the protein-coding genes from the GENCODE 20 gene set (Ensembl 76). What did they find?

We identified alternative splice isoforms for just 246 human genes; this clearly suggests that the vast majority of genes express a single main protein isoform.

They also found that the mass spec data was not always in agreement with the dominant isoforms that can be predicted from RNA-Seq data:

…part of the reason for this will be that more RNAseq reads map to longer sequences, it does suggest that either transcript expression is very different from protein expression for many genes or that transcript reconstruction methods may not be interpreting the RNAseq reads correctly.

The headline conclusion that mass spec evidence only supports alternate isoforms for 1.2% of human genes is thought provoking. It suggests to me that we should be careful in relying too heavily on gene annotations which describe large numbers of isoforms mostly on the basis of transcript data. Paraphrasing George Orwell:

All isoforms are equal, but some isoforms are more qual than others