Registration is now open for the UK 2015 Genome Science meeting

Registration for the 2015 Genome Science meeting is now open. This is the meeting formally known as Genome Science: Biology, Applications and Technology, which in turn was formally known as The UK Next Generation Sequencing Meeting. I expect that next year it will just be known as Genome.

It's a fun meeting which all the cool kids go to, and it's in Brum so you will at least be able to get a decent curry.

The scientific sessions will be as follows:

  • 20 years of bacterial genomics: Dr Nick Loman, University of Birmingham
  • Environmental genomics: Dr Holly Bik, University of Birmingham
  • Functional genomics: Associate Professor Aziz Aboobaker, University of Oxford
  • New technologies: Dr Mike Quail, Wellcome Trust Sanger Institute
  • Plant and animal genomics: Mr Mick Watson, Edinburgh Genomics
  • Novel computational methods: Professor Chris Ponting, University of Oxford
  • Single cell genomics, Professor Neil Hall, University of Liverpool

It is not altogether inconceivable that evening entertainment will be provided by The Nick & Mick Show, where Messrs. Loman and Watson might showcase their latest venture — Nanopore: the musical.

When will ‘open science’ become simply ‘science’?

A great commentary piece by Mick Watson (@BioMickWatson) in Genome Biology where he discusses the six O's (Open data, Open access, Open methodology, Open source, Open peer review, and Open education). On the former:

…it is no longer acceptable for scientists to hold on to data until they have extracted every last possible publication from it. The data do not belong to the scientist, they belong to the funder (quite often the taxpayer). Datasets should be freely available to those who funded them. Scientists who hoard data, far from pushing back the boundaries of human knowledge, instead act as barriers to discovery.

Amen.

101 questions with a bioinformatician #26: Kerstin Howe

Kerstin Howe is a Senior Scientific Manager, leading the Genome Reference Informatics group at the Wellcome Trust Sanger Institute. As part of the Genome Reference Consortium (GRC), Kerstin’s group is helping ensure that “the human, mouse and zebrafish reference assemblies are biologically relevant by closing gaps, fixing errors and representing complex variation”.

This important work entails the generation of ‘long range’ information (sequencing and optical mapping) for a variety of genomes and using that information to provide genome analyses, visualise assembly evaluations, and curate assemblies. You may also wish to check out 101 questions interviewee #3 (Deanna Church), another key player in the GRC.

Kerstin is not my first ‘101 Questions’ interviewee that I know from my time working on the WormBase project. Unlike interviewee #23 though, I did have the pleasure of sharing an office with Kerstin — or WBPerson3103 as she will forever be known in WormBase circles — during my time at the Sanger Institute. It was after leaving WormBase that she became a big fish (of a little fish) in the vertebrate genomics community. And now, on to the 101 questions…

Read More

How do we assess the value of biological data repositories?

For the most part, model organism databases such as WormBase, FlyBase, SGD etc. offer an invaluable and indispensible resource to their respective communities. These sites exist due to funding agencies recognizing their essential nature, and are typically funded on five-year grants.

There are many other bioinformatics resources out there, most of which have probably been useful to some people at some time or other. But how should we decide what is useful enough to merit continued funding? It's a very tricky question, and is one which Todd Harris is starting to explore on his blog:

In an era of constrained funding and shifting focus, how do we effectively measure the value of online biological data repositories? Which should receive sustained funding? Which should be folded into other resources? What efficiencies can be gained over how these repositories are currently operated?

Worth a read. I've written before about the issue of whether there are too many biological databases, especially when very multiple databases emerge with heavily overlapping areas of interest. I think funding agencies and journals should think carefully before supporting/publishing new resources without first really establishing:

  1. Is this really needed?
  2. Does it overlap with other existing resources?
  3. What will happen to the resource should funding and/or personnel not be available to keep it going?

Goodbye CEGMA, hello BUSCO!

Some history

CEGMA (Core Eukaryotic Genes Mapping Approach) is a bioinformatics tool that I helped develop about a decade ago. It led to a paper that outlined how you could use CEGMA to find a small number of highly conserved genes in any new genome sequence that might be devoid of annotation. Once you have even a handful of genes, then you can use that subset to train a gene finder in order to annotate the rest of the genome.

It was a bit of a struggle to get the paper accepted by a journal, and so it didn't appear until 2007 (a couple of years after we had started work on the project). For CEGMA to work it needed to use a set of orthologous genes that were highly conserved across different eukaryotic species. We used the KOGs database (euKaryotic Orthologous Groups) that was first described in 2003. So by the time of our publication we were already using a dataset that was a few years old.

The original CEGMA paper did not attract much attention but we subsequently realized that the same software could be used to broadly assess how complete the 'gene space' was of any published genome assembly. To do this, we defined a subset of core genes that were the most highly conserved and which tended to be single-copy genes. The resulting 2009 paper seemed to generate a lot of interest in CEGMA and citations to the original paper have increased every year since (139 citations in 2014).

This is good news except:

  1. CEGMA can be a real pain to install due to its dependency on many other tools (though we've made things easier)
  2. CEGMA has been very hard to continue developing. The original developer left our group about 7 years ago and he was the principle software architect. I have struggled to keep CEGMA working and updated.
  3. CEGMA continues to generate a lot of support email requests (that end up being dealt with by me).

We have no time or resources to devote to CEGMA but the emails keep on coming. It's easy to envisage many ways how CEGMA could be improved and extended; we submitted a grant proposal to do this but it was unsuccessful. One planned aspect of 'CEGMA v3' was to replace the reliance on the aging KOGs database. Another aspect of the new version of CEGMA would be to develop clade-specific sets of core genes. If you are interested in plant genomes, we should be able to develop a much larger set of plant-specific core genes.

And so?

Today I draw a line in the sand and say…

CEGMA is dead

CEGMA had a good life and and we shall look back with fond memories, but its time has passed. Please don't grieve (and don't send flowers), but be thankful that people will no longer have to deal with the software-dependency-headache of trying to get Genewise working on Ubuntu Linux.

But what now?

The future of CEGMA has arrived and it's called BUSCO.

  • BUSCO: assessing genome assembly and annotation completeness with single-copy ortholog

This new tool (Benchmarking Universal Single-Copy Orthologs) has been developed by Felipe A. Simao, Robert Waterhouse, Panagiotis Ioannidis, Evgenia Kriventseva, and Evgeny Zdobnov. You can visit the BUSCO website, have a read of the manual, or look at the supplementary online material (this is set out like a paper…I don't think the tool has been formally published yet though). Here is the first section from that supplementary material:

Benchmarking Universal Single-Copy Orthologs (BUSCO) sets are collections of orthologous groups with near-universally-distributed single-copy genes in each species, selected from OrthoDB root-level orthology delineations across arthropods, vertebrates, metazoans, fungi, and eukaryotes (Kriventseva, et al., 2014; Waterhouse, et al., 2013). BUSCO groups were selected from each major radiation of the species phylogeny requiring genes to be present as single-copy orthologs in at least 90% of the species; in others they may be lost or duplicated, and to ensure broad phyletic distribution they cannot all be missing from one sub-clade. The species that define each major radiation were selected to include the majority of OrthoDB species, excluding only those with unusually high numbers of missing or duplicated orthologs, while retaining representation from all major sub-clades. Their widespread presence means that any BUSCO can therefore be expected to be found as a single-copy ortholog in any newly-sequenced genome from the appropriate phylogenetic clade (Waterhouse, et al., 2011). A total of 38 arthropods (3,078 BUSCO groups), 41 vertebrates (4,425 BUSCO groups), 93 metazoans (1,008 BUSCO groups), 125 fungi (1,438 BUSCO groups), and 99 eukaryotes (431 BUSCO groups), were selected from OrthoDB to make up the initial BUSCO sets which were then filtered based on uniqueness and conservation as described below to produce the final BUSCO sets for each clade, representing 2,675 genes for arthropods, 3,023 for vertebrates, 843 for metazoans, 1,438 for fungi, and 429 for eukaryotes. For bacteria, 40 universal marker genes were selected from (Mende, et al., 2013).

BUSCO seems to do everything that we wanted to include in CEGMA v3 and it is based on OrthoDB, a resource that has generated a new set of orthologs (developed by the same authors). The online material includes a comparison of BUSCO to CEGMA, and also outlines how BUSCO can be much quicker than CEGMA (depending on what set of of orthologs you use).

DISCLAIMER: I have not installed, tested, or analyzed BUSCO in any way. I make no promises as to its performance, but they seem to have gone about things in the right way.

This is what happens when you name your bioinformatics tool 'JEPEG'

This is always going to be a problem if you search for 'JEPEG' alone, but if you include 'bioinformatics' in your search, then the top hit correctly points to the name of this recently described bioinformatics tool:

The name 'JEPEG' is derived from Joint Effect on Phenotype of Eqtl/functional snps associated with a Gene, a name that is worthy of a JABBA award. Also, it's not clear to me how the name should be pronounced: 'jee-peg', 'jep-egg', 'jep-peg'?

Kablammo: an interactive, web-based BLAST results visualizer

I like it when someone publishes a new tool that:

  1. Has a clever, fun name which isn't just another bogus bioinformatics acronym.
  2. Has a great looking website with a fun logo (bonus points for being in charge of the domain name so that links won't break should the Wasmuth lab move somewhere else).
  3. Has the code available on GitHub.

Congratulations to Jeff Wintersinger and James Wasmuth!

How can you choose a single isoform to best represent a gene?

I asked this question a few weeks ago. There are some situations in bioinformatics when you want to look at all genes from one or more organisms, but where you want to only have one representative of a gene. This is not always a straightforward task, and I previously discussed how many people opt for overly simplistic methods such as 'choosing the longest isoform'.

Since my blog post I had some feedback and have also come across a relevant paper which I would like to share here.

Feedback

Sara G said:

We use epigenetic data from ENCODE/modENCODE/Epigenetic roadmap or our own data to identify the transcription start site that has the most evidence of transcriptional activity.

From my experience with using data from fly, worm, and Arabidopsis I have often seen many genes which have multiple isoforms that all share the same transcription start site (and differ in downstream exon/intron structure).

Michael Paulini commented:

What you can easily do (if you got the data) is just pick the isoform with the highest FPKM value per gene.

The important caveat that Michael mentions is that you might not have the data, and if you are trying to do this as part of multiple species comparison then this becomes a much more complex task.

Richard Edwards added to the debate (I've edited his response a little bit):

My previous solution to this problem was this:

"…a dataset consisting of one protein sequence per gene was constructed from the EnsEMBL human genome…by mapping all EnsEMBL human peptides onto their genes and assessing them in the context of the external database entry used as evidence for that gene. If the external database was SwissProt and one of the peptides has the exact same sequence as the SwissProt sequence, this peptide sequence was used for that gene. In all other cases, the longest peptide (in terms of non-X amino acids) was used."

Richard also conceded that this approach "only works for well annotated proteomes".

Another approach

For Arabidopsis thaliana, I have previously used data from their exon confidence ranking system which ends up producing a 5 star rating for every transcript of a gene. This is based on various strands of evidence and can be useful for choosing between isoforms which have different star ratings (but not so helpful when all isoforms have the same rating). The TAIR database remains the only model organism database (that I know of) that have attempted to address this problem and provide a systematic way of ranking all genes. Anecdotally, I would say that not many people know about this system and you need to find the relevant files on their FTP site to make use of this information.

Hopefully someone will correct me if other databases have implemented similar schemes!

From the literature

If your research focuses on vertebrates, then this approach looks promising.

Here's the abstract (emphasis mine):

Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform.

It seems that since this paper was published, they have expanded their dataset and the online APPRIS database now has annotations for five vertebrate genomes (human, mouse, zebrafish, rat, and pig).

The future

This APPRIS approach seems promising, yet I wish there was an standardized ranking system used by all model organism databases (as well as Ensembl, UCSC Genome Browser etc.).

There are always going to be different levels of supporting evidence available in different species. E.g. not every organism is going to be able to make use of mass spec data to assess the differential usage of transcript variants. Furthermore, there are going to be tissue-specific — not to mention time-specific — patterns of differential isoform expression for some genes.

However, what I would love to see is each database using whatever information they have availble to them to assign relative 'weights' to each annotated isoform. Initially, this would focus on protein-coding genes and ideally, this scheme would entail the following:

  1. A score for every annotated isoform of a gene where all scores sum to 100. In my favorite example of the Caenorhabditis elegans rpl-22 gene these scores would probably end up as 99 for the dominant isoform and 1 for the secondary isoform. The scores should reflect the relative likelihood that this transcript encodes a functional protein. To keep things simple, I would prefer integer values with values of zero allowed for annotated isoforms that represent pseudogenic transcripts, targets of nonsense-mediated decay etc.
  2. Evidence for the score. This would be something akin to the Gene Ontology Evidence codes. Each gene may have one or more codes attached to them and these would reflect the source of the experimental evidence (transcript data, mass spec, functional assay etc.)
  3. Spatial/temporal information. If some of the supporting evidence is from a particular cell-line, tissue, or developmental time point then it should be possible to annotate this.
  4. Historical information. Such scores should be expected to change over time as new evidence emerges. All transcripts should keep a time-stamped history of previous scores to allow researchers to see whether isoforms have increased or decreased in their relative probability of usage over time.
  5. Data to be added to GTF/GFF files: The score of each transcript should be embedded as a note in the attribution column of these files, maybe as a 'RF' score (Relative Functionality).

Can someone please make this happen? :-)