L50 vs N50: that's another fine mess that bioinformatics got us into

N50 is a statistic that is widely used to describe genome assemblies. It describes an average length of a set of sequences, but the average is not the mean or median length. Rather it is the length of the sequence that takes the sum length of all sequences — when summing from longest to shortest — past 50% of the total size of the assembly. The reasons for using N50, rather than the mean or median length, is something that I've written about before in detail.

The number of sequences evaluated at the point when the sum length exceeds 50% of the assembly size is sometimes referred to as the L50 number. Admittedly, this is somewhat confusing: N50 describes a sequence length whereas L50 describes a number of sequences. This oddity has led to many people inverting the usage of these terms. This doesn't help anyone and leads to confusion and to debate.

I believe that the aforementioned definition of N50 was first used in the 2001 publication of the human genome sequence:

We used a statistic called the ‘N50 length’, defined as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L.

I've since had some independent confirmation of this from Deanna Church (@deannachurch):

I also have a vague memory that other genome sequences — that were made available by the Sanger Institute around this time — also included statistics such as N60, N70, N80 etc. (at least I recall seeing these details in README files on an FTP site). Deanna also pointed out that the Celera Human Genome paper (published in Science, also in 2001) describes something that we might call N25 and N90, even though they didn't use these terms in the paper:

More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or large

I don't know when L50 first started being used to describe lengths, but I would bet it was after 2001. If I'm wrong, please comment below and maybe we can settle this once and for all. Without evidence for an earlier use of L50 to describe lengths, I think people should stick to the 2001 definition of N50 (which I would also argue is the most common definition in use today).

Updated 2015-06-26 - Article includes new evidence from Deanna Church.

Which genome assembler gives you the best genome assembly?

This is a question that I have been asked many times. I think that the opposite question should also be asked — which genome assembler gives you the worst genome assembly? —  but people seem less interested in asking this. By the end of this blog post, I will try to answer both questions.

Earlier this week, I wrote about the fact that I get to look at lots of genome assemblies that people have asked me to run CEGMA on. The reason I run CEGMA is to calculate what percentage of a set of 248 Core Eukaryotic Genes (CEGs) are present in each genome (or transcriptome) assembly. This figure can be used as a proxy for the much larger set of 'all genes'.

I can also calculate the N50 length of each assembly, and if you plot both statistics for each assembly you capture a tiny slice of that nebulous characteristic known as 'assembly quality'. Here's what such a plot looks like for 32 different genome assemblies (from various species, assembled by various genome assemblers):

Figure 1. CEGMA results vs N50 length for 32 genome assemblies. Click to enlarge.

Three key observations can be made from this figure:

  1. There's a lot of variation in the percentage of CEGs present (52.4–98.8%).
  2. There's a lot of variation in N50 length (997 bp all the way to 3.9 million bp).
  3. There is almost no correlation between the two measures (= 0.04)

Let's dwell on that last point by highlighting a couple of the extreme data points:

Figure 2: CEGMA results vs N50 length — red data points highlight assemblies with highest values of %248 CEGs or N50 length. Click to enlarge.

The highlighted point at the top of the graph represents the assembly with the best CEGMA result (245 out of 248 CEGs present). However, this assembly ranks 13th for N50 length. The data point on the far right of the graph represents a genome assembly with the highest N50 length (almost 4 million bp). But this assembly only ranks 27th for its CEGMA results. Such inconsistency was exactly what we saw in the Assemblathon 2 paper (but with even more metrics involved).

Can we shed any light as to which particular genome assemblers excel in either of these statistics? Well, as I now ask people who submit CEGMA jobs to me what was the principle assembly tool used?, I can overlay this information on the plot:

Figure 3: CEGMA results vs N50 length — colors refer to different assemblers used as the principle software tool in the assembly process. Click to enlarge.

It might not be clear but there are more data points for the Velvet assembler than any other (12/32). You can see that Velvet assemblies produce a relatively wide range in CEGMA results. Assemblies made by SOAPdenovo produce an even wider range of CEGMA results (not to mention a wide range of N50 results). The truth is that there is no consistent pattern of quality in the results of any one assembler (and remember we are only measuring 'quality' by just two paltry metrics).

To answer the questions raised at the start of this post:

  1. All assemblers can be used to make terrible genome assemblies
  2. Some assemblers can (probably) be used to make great genome assemblies

There is no magic bullet in genome assembly and there are so many parameters that can affect the quality of your final assembly (repeat content of genome, sequencing technology biases, amount of heterozygosity in genome, quality of input DNA, quality of sample preparation steps, suitable mix of libraries with different insert sizes, use of most suitable assembler options for your genome of interest, amount of coffee drunk by person running the assembler, etc. etc.).

Don't even ask the question which genome assembler gives you the best genome assembly? if you are not prepared to define what you mean by 'best' (and please don't define it just on the basis of two simple metrics like %248 CEGs present and N50 length).

The long tail of the distribution: do people ever really look at their genome assemblies?

Since I (somewhat foolishly) volunteered to run CEGMA on behalf of people who were having trouble installing it, I've had the opportunity to look at a lot of genome (and transcriptome) assemblies.

Some people get a little bit obsessed by just how many (or how few) core genes are present in their assembly. They also get obsessed by how long (or short) the N50 length of their assembly is.

Earlier this week, I wrote about a new tool (N50 Booster!!!) that can increase the N50 length of any assembly. Although this was published on April 1st, the tool does actually work and it merits some further discussion. The tool achieves higher N50 lengths by doing two things:

  1. It removes the shortest 25% of all sequences
  2. It then sums the lengths of sequences that were removed and adds an equivalent length of Ns to the end of the longest sequence

Both steps contribute to increasing the N50 length, though the amount of the increase really depends on the distribution of sequence lengths in the assembly. The second step is clearly bogus, and I added it just to preserve the assembly size. The first step might seem drastic, but is exactly the same sort of step that is performed by many genome assemblers (or by the people post-processing the assembly).

If you don't remove short sequences from an assembly, you could end up including a large number of reads that don't overlap with any other read. I.e. it is possible that the shortest contig/scaffold length in an assembly will be the same as whatever the read length of the sequencing technology being used is. If you trim reads for quality, you could potentially end up with contigs/scaffolds with even shorter lengths.

How useful is it to include such short sequences in an assembly, and how often do 'assemblers' (the software and/or the people running the assemblers) do this? Well by looking at some of the assemblies that I have run CEGMA against, I can take a look.

For 34 genome assemblies (from many different species, using many different assemblers) I looked to see whether the shortest 10 sequences were all the same length or were unique:

So about half of the assemblies have unique lengths for their 10 shortest sequences. The remainder represent assemblies that probably either removed all sequences below a certain length (which seems likely with the assembly that had the shortest sequences at 2,000 bp), or which simply included all unassembled reads (six assemblies have an abundance of 100 bp sequences).

This begs the question, how useful is it to include all of these short sequences? It's always possible that a 100 bp read that doesn't overlap anything else contains something useful, but probably not. You might see an exon, possibly an intron, and very possibly an entire gene (depending on the species)...but can anyone do much with this information?

The most extreme example I came across is actually from one of the 16 assemblies which initially appeared to have sequences with unique lengths. This vertebrate genome assembly contains 72,214 sequences. If I ask what percentage of these sequence are shorter than various cutoffs, this is what I find:

This is pretty depressing. The vast majority of sequences in this assembly are likely to be too short to be of any use. The reason that this assembly counted among by 'unique' category is because the shortest ten sequences have lengths as follows: 100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp. That's right, this assembly includes a sequence that's only three base pairs in length!

There are a couple of points that  I am trying to make here:

  1. You can increase the N50 length of your assembly by removing the shortest sequences.  This is not cheating — though you should clearly state what cutoff has been used — and should be considered part of the genome assembly process (getting rid of the crap).
  2. Please look at your assembly before doing anything with it. On it's own, N50 is not a very useful statistic. Look at the distribution of all sequence lengths (or plot an NG(X) graph).

It turned out that this assembly did contain a fair number of core genes and these were almost all located in the 1.2% of sequences that were > 10,000 bp. That 3 bp sequence though, turned out it contained no core genes at all. Shocker!