L50 vs N50: that's another fine mess that bioinformatics got us into

N50 is a statistic that is widely used to describe genome assemblies. It describes an average length of a set of sequences, but the average is not the mean or median length. Rather it is the length of the sequence that takes the sum length of all sequences — when summing from longest to shortest — past 50% of the total size of the assembly. The reasons for using N50, rather than the mean or median length, is something that I've written about before in detail.

The number of sequences evaluated at the point when the sum length exceeds 50% of the assembly size is sometimes referred to as the L50 number. Admittedly, this is somewhat confusing: N50 describes a sequence length whereas L50 describes a number of sequences. This oddity has led to many people inverting the usage of these terms. This doesn't help anyone and leads to confusion and to debate.

I believe that the aforementioned definition of N50 was first used in the 2001 publication of the human genome sequence:

We used a statistic called the ‘N50 length’, defined as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L.

I've since had some independent confirmation of this from Deanna Church (@deannachurch):

I also have a vague memory that other genome sequences — that were made available by the Sanger Institute around this time — also included statistics such as N60, N70, N80 etc. (at least I recall seeing these details in README files on an FTP site). Deanna also pointed out that the Celera Human Genome paper (published in Science, also in 2001) describes something that we might call N25 and N90, even though they didn't use these terms in the paper:

More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or large

I don't know when L50 first started being used to describe lengths, but I would bet it was after 2001. If I'm wrong, please comment below and maybe we can settle this once and for all. Without evidence for an earlier use of L50 to describe lengths, I think people should stick to the 2001 definition of N50 (which I would also argue is the most common definition in use today).

Updated 2015-06-26 - Article includes new evidence from Deanna Church.

And the award for needless use of subscript in the name of a bioinformatics tool goes to…

The following paper can be found in the latest issue of Bioinformatics:

MoRFs are molecular recognition features, and the tool that the authors developed to identify them is called:

MoRFCHiBi

So the tool's name includes a subscripted version of 'CHiBi', a name which is taken from the shorthand name for the Center for High-Throughput Biology at the University of British Columbia (this is where the software was presumably developed). The website for MoRFCHiBi goes one step further by describing something called the MoRFChiBi,mc predictor. I'm glad that they felt that some italicized text was just the thing to complement the subscripted, mixed case name.

The subscript seems to serve no useful purpose and just makes the software name harder to read, particularly because it combines a lot of mixed capitalization. It also doesn't help that 'ChiBi' can be read as 'kai-bye' or 'chee-bee'. I'm curious whether the CHiBi be adding their name as a subscripted suffix to all of their software, or just this one?

A great slide deck about how to put together the new format NIH Biosketch

This is a little bit off-topic, but I found it useful…

Earlier this year, Janet Gross and Gary Miller from the Rollins School of Public Health at Emory University put together a very useful guide on how to put together the new format NIH Biosketch:

There is also a nice page of grant writing tools available. I liked how they highlight the important point that cramming as many words as possible into the five pages should not be the goal. Aesthetics and layout matter!

101 questions with a bioinformatician #27: Michael Barton

Michael Barton is a Bioinformatics Systems Analysis at the Joint Genome Institute (that makes him a JGI BSA?). His work involves developing automated methods for the quality checking of sequencing data and evaluating new bioinformatics software. He may introduce himself as Michael, but as his twitter handle suggests, he is really Mr. Bioinformatics.

His nucleotid.es website is doing amazing things in the field of genome assembly by using Docker containers to try to parcel up genome assembly pipelines. This is enabling the ‘continuous, objective and reproducible evaluation of genome assemblers using docker containers’. Related to this is the bioboxes project — a great name by the way — which may just succeed in revolutionizing how bioinformatics is done. From the bioboxes manifesto:

Software has proliferated in bioinformatics and so have the problems associated with it: missing or unobtainable code, difficult to install dependencies, unreproducible workflows, all with terrible user experiences. We believe a community standard, using software containers, has the opportunity to solve these problems and increase the standard of scientific software as a whole.

You can find out more about Michael by visiting his Bioinformatics Zen blog or by following him on twitter (@bioinformatics). And now, on to the 101 questions…

Read More

Registration is now open for the UK 2015 Genome Science meeting

Registration for the 2015 Genome Science meeting is now open. This is the meeting formally known as Genome Science: Biology, Applications and Technology, which in turn was formally known as The UK Next Generation Sequencing Meeting. I expect that next year it will just be known as Genome.

It's a fun meeting which all the cool kids go to, and it's in Brum so you will at least be able to get a decent curry.

The scientific sessions will be as follows:

  • 20 years of bacterial genomics: Dr Nick Loman, University of Birmingham
  • Environmental genomics: Dr Holly Bik, University of Birmingham
  • Functional genomics: Associate Professor Aziz Aboobaker, University of Oxford
  • New technologies: Dr Mike Quail, Wellcome Trust Sanger Institute
  • Plant and animal genomics: Mr Mick Watson, Edinburgh Genomics
  • Novel computational methods: Professor Chris Ponting, University of Oxford
  • Single cell genomics, Professor Neil Hall, University of Liverpool

It is not altogether inconceivable that evening entertainment will be provided by The Nick & Mick Show, where Messrs. Loman and Watson might showcase their latest venture — Nanopore: the musical.

When will ‘open science’ become simply ‘science’?

A great commentary piece by Mick Watson (@BioMickWatson) in Genome Biology where he discusses the six O's (Open data, Open access, Open methodology, Open source, Open peer review, and Open education). On the former:

…it is no longer acceptable for scientists to hold on to data until they have extracted every last possible publication from it. The data do not belong to the scientist, they belong to the funder (quite often the taxpayer). Datasets should be freely available to those who funded them. Scientists who hoard data, far from pushing back the boundaries of human knowledge, instead act as barriers to discovery.

Amen.

101 questions with a bioinformatician #26: Kerstin Howe

Kerstin Howe is a Senior Scientific Manager, leading the Genome Reference Informatics group at the Wellcome Trust Sanger Institute. As part of the Genome Reference Consortium (GRC), Kerstin’s group is helping ensure that “the human, mouse and zebrafish reference assemblies are biologically relevant by closing gaps, fixing errors and representing complex variation”.

This important work entails the generation of ‘long range’ information (sequencing and optical mapping) for a variety of genomes and using that information to provide genome analyses, visualise assembly evaluations, and curate assemblies. You may also wish to check out 101 questions interviewee #3 (Deanna Church), another key player in the GRC.

Kerstin is not my first ‘101 Questions’ interviewee that I know from my time working on the WormBase project. Unlike interviewee #23 though, I did have the pleasure of sharing an office with Kerstin — or WBPerson3103 as she will forever be known in WormBase circles — during my time at the Sanger Institute. It was after leaving WormBase that she became a big fish (of a little fish) in the vertebrate genomics community. And now, on to the 101 questions…

Read More

How do we assess the value of biological data repositories?

For the most part, model organism databases such as WormBase, FlyBase, SGD etc. offer an invaluable and indispensible resource to their respective communities. These sites exist due to funding agencies recognizing their essential nature, and are typically funded on five-year grants.

There are many other bioinformatics resources out there, most of which have probably been useful to some people at some time or other. But how should we decide what is useful enough to merit continued funding? It's a very tricky question, and is one which Todd Harris is starting to explore on his blog:

In an era of constrained funding and shifting focus, how do we effectively measure the value of online biological data repositories? Which should receive sustained funding? Which should be folded into other resources? What efficiencies can be gained over how these repositories are currently operated?

Worth a read. I've written before about the issue of whether there are too many biological databases, especially when very multiple databases emerge with heavily overlapping areas of interest. I think funding agencies and journals should think carefully before supporting/publishing new resources without first really establishing:

  1. Is this really needed?
  2. Does it overlap with other existing resources?
  3. What will happen to the resource should funding and/or personnel not be available to keep it going?

Goodbye CEGMA, hello BUSCO!

Some history

CEGMA (Core Eukaryotic Genes Mapping Approach) is a bioinformatics tool that I helped develop about a decade ago. It led to a paper that outlined how you could use CEGMA to find a small number of highly conserved genes in any new genome sequence that might be devoid of annotation. Once you have even a handful of genes, then you can use that subset to train a gene finder in order to annotate the rest of the genome.

It was a bit of a struggle to get the paper accepted by a journal, and so it didn't appear until 2007 (a couple of years after we had started work on the project). For CEGMA to work it needed to use a set of orthologous genes that were highly conserved across different eukaryotic species. We used the KOGs database (euKaryotic Orthologous Groups) that was first described in 2003. So by the time of our publication we were already using a dataset that was a few years old.

The original CEGMA paper did not attract much attention but we subsequently realized that the same software could be used to broadly assess how complete the 'gene space' was of any published genome assembly. To do this, we defined a subset of core genes that were the most highly conserved and which tended to be single-copy genes. The resulting 2009 paper seemed to generate a lot of interest in CEGMA and citations to the original paper have increased every year since (139 citations in 2014).

This is good news except:

  1. CEGMA can be a real pain to install due to its dependency on many other tools (though we've made things easier)
  2. CEGMA has been very hard to continue developing. The original developer left our group about 7 years ago and he was the principle software architect. I have struggled to keep CEGMA working and updated.
  3. CEGMA continues to generate a lot of support email requests (that end up being dealt with by me).

We have no time or resources to devote to CEGMA but the emails keep on coming. It's easy to envisage many ways how CEGMA could be improved and extended; we submitted a grant proposal to do this but it was unsuccessful. One planned aspect of 'CEGMA v3' was to replace the reliance on the aging KOGs database. Another aspect of the new version of CEGMA would be to develop clade-specific sets of core genes. If you are interested in plant genomes, we should be able to develop a much larger set of plant-specific core genes.

And so?

Today I draw a line in the sand and say…

CEGMA is dead

CEGMA had a good life and and we shall look back with fond memories, but its time has passed. Please don't grieve (and don't send flowers), but be thankful that people will no longer have to deal with the software-dependency-headache of trying to get Genewise working on Ubuntu Linux.

But what now?

The future of CEGMA has arrived and it's called BUSCO.

  • BUSCO: assessing genome assembly and annotation completeness with single-copy ortholog

This new tool (Benchmarking Universal Single-Copy Orthologs) has been developed by Felipe A. Simao, Robert Waterhouse, Panagiotis Ioannidis, Evgenia Kriventseva, and Evgeny Zdobnov. You can visit the BUSCO website, have a read of the manual, or look at the supplementary online material (this is set out like a paper…I don't think the tool has been formally published yet though). Here is the first section from that supplementary material:

Benchmarking Universal Single-Copy Orthologs (BUSCO) sets are collections of orthologous groups with near-universally-distributed single-copy genes in each species, selected from OrthoDB root-level orthology delineations across arthropods, vertebrates, metazoans, fungi, and eukaryotes (Kriventseva, et al., 2014; Waterhouse, et al., 2013). BUSCO groups were selected from each major radiation of the species phylogeny requiring genes to be present as single-copy orthologs in at least 90% of the species; in others they may be lost or duplicated, and to ensure broad phyletic distribution they cannot all be missing from one sub-clade. The species that define each major radiation were selected to include the majority of OrthoDB species, excluding only those with unusually high numbers of missing or duplicated orthologs, while retaining representation from all major sub-clades. Their widespread presence means that any BUSCO can therefore be expected to be found as a single-copy ortholog in any newly-sequenced genome from the appropriate phylogenetic clade (Waterhouse, et al., 2011). A total of 38 arthropods (3,078 BUSCO groups), 41 vertebrates (4,425 BUSCO groups), 93 metazoans (1,008 BUSCO groups), 125 fungi (1,438 BUSCO groups), and 99 eukaryotes (431 BUSCO groups), were selected from OrthoDB to make up the initial BUSCO sets which were then filtered based on uniqueness and conservation as described below to produce the final BUSCO sets for each clade, representing 2,675 genes for arthropods, 3,023 for vertebrates, 843 for metazoans, 1,438 for fungi, and 429 for eukaryotes. For bacteria, 40 universal marker genes were selected from (Mende, et al., 2013).

BUSCO seems to do everything that we wanted to include in CEGMA v3 and it is based on OrthoDB, a resource that has generated a new set of orthologs (developed by the same authors). The online material includes a comparison of BUSCO to CEGMA, and also outlines how BUSCO can be much quicker than CEGMA (depending on what set of of orthologs you use).

DISCLAIMER: I have not installed, tested, or analyzed BUSCO in any way. I make no promises as to its performance, but they seem to have gone about things in the right way.