The changing landscape of sequencing platforms that underpin genome assembly

From Flickr user itsrick208. CC BY-NC 2.0

From Flickr user itsrick208CC BY-NC 2.0

In my last blog post I looked at the the amazing growth over the last two decades in publications that relate to genome assembly.

In this post, I try seeing whether Google Scholar can also shed any light on which sequencing technologies have been used to help understand, and improve, genome assembly.

Here is a rough overview of the major sequencing platforms that have underpinned genome assembly over the years. I’ve focused on time points when there were sequencing instruments that people were actually using rather than when the technology was first invented or described. This is why I start Sanger sequencing at 1995 with the AB310 sequencer rather than 1977.

Click to enlarge

Return to Google Scholar

So how can you find publications which concern genome assembly using these technologies? Well here are my Google Scholar searches that I used to try to identify relevant publications.

  1. Sanger — "genome assembly"|"de novo assembly" sanger -sanger.ac.uk — I had to exclude the Sanger’s website address as this was used in many papers that might not be talking about Sanger sequencing per se.
  2. Roche 454 — "genome assembly"|"de novo assembly" 454 (roche |pyrosequencing) — another tricky one as ‘454’ alone was not a suitable keyword for searching.
  3. Illumina — "genome assembly"|"de novo assembly" (illumina|solexa) — obviously need to include Solexa in this search as well.
  4. ABI SOLiD — "genome assembly"|"de novo assembly" “ABI solid”
  5. Ion Torrent — "genome assembly"|"de novo assembly" "ion torrent”
  6. PacBio — "genome assembly"|"de novo assembly" ("PacBio"|"Pacific Biosciences”)
  7. Oxford Nanopore Technologies — "genome assembly"|"de novo assembly" "Oxford Nanopore”

Now obviously, many of these searches are flawed and are going to miss publications or include false positives. This makes comparing the absolute numbers of publications between technologies potentially misleading. However, it should still be illuminating to look at the trends of how publications for each of these technologies have changed over time.

The results

As in my last graph, I plot the number of publications on a log scale.

Click to enlarge

Observations

  1. Publications about genome assembly that mention Sanger sequencing dominate the first decade of this graph before being overtaken by Illumina in 2009.
  2. The growth of publications for Sanger is starting to slow down
  3. Publications for Roche 454 peaked in 2015 and have started to decline
  4. Publications concerning Ion Torrent peaked a year later in 2016
  5. ABI SOLiD shows the clearest ‘rise and fall’ pattern with five years now of declining publications about genome assembly
  6. The rate of growth for PacBIo publications has been pretty solid but may have just slowed a little in 2017
  7. Oxford Nanopore, the newest kid on the block — in terms of commercially available products — has been on a solid period of exponential growth and looks set to overtake Ion Torrent (and maybe Roche 454) this year.

Are we about to reach ‘peak genome assembly’?

Sanger Peak. Image from Google Maps.

Sanger Peak. Image from Google Maps.

The ever-declining costs of DNA sequencing technologies — no, I’m not going to show that graph — has meant that the field of genome assembly has exploded over the last decade.

Plummeting costs are obviously not the only reason behind this. The evolving nature of sequencing technologies has meant that this year has pushed us into the brave new era of megabase pair read lengths!

Think of the poor budding yeast: the first eukaryotic species to have its (12 Mbp) genome sequenced. There was a time when the sequencing of individual yeast chromosomes would merit their own Nature publication! Now only chromosome IV remains as the last yeast chromosome whose length couldn’t be exceeded by a single Oxford Nanopore read (but probably not for much longer!). Update 2018-09-12: a 2.2 Mbp Nanopore read means that chromsome IV's length has now been eclipsed!

Looking for genome assembly publications

I turned to the font of all (academic) knowledge, Google Scholar, for answers. I wanted to know whether interest in genome assembly had reached a peak, and by ‘interest’ I mean publications or patents that specifically mention either ‘genome assembly’ or ‘de novo assembly’.

Some obvious caveats:

  1. Google Scholar is not a perfect source of publications: some papers are missing, some appear multiple times, and occasionally some are associated with the wrong year.
  2. Publications are increasing in many fields due to more scientists being around and the inexorable rise of if-you-pay-us-money-and-randomly-hit-keys-on-your-keyboard-we-will-publish-it publishing. So a rise in publications in topic 'X' does not necessarily reflect more interest in that topic.
  3. Not all publications concerning genome assembly will contain the phrases ‘genome assembly’ or ‘de novo assembly’.

Caveats aside, let’s see what Google thinks about the state of genome assembly:

Click to enlarge

Does this tell us anything?

So there’s clearly been a pretty explosive growth in publications concerning genome assembly over the last couple of decades. Interestingly, the data from 2017 suggest that the period of exponential growth is starting to slow just a little bit. However, it would seem that we have not reached ‘peak genome assembly’ just yet.

There are, no doubt, countless hundreds (thousands?) of publications that concern technical aspects of genome assembly which have reached dead ends or which have become obsolete (pipelines for your ABI SOLiD data?).

Maybe we are starting to reach an era where the trio of leading technologies (Illumina, Pacific Biosciences, and Oxford Nanopore) are good enough to facilitate — alone, or in combination — easier (or maybe less troublesome) genome assemblies. I’ve previously pointed out how there are more ‘improved’ assemblies being published than ever before.

Maybe the field has finally moved the focus away from ‘how do we do get this to work properly?’ to ‘what shall we assemble next?’. In a follow-up post, I’ll be looking at the rise and fall of different sequencing technologies throughout this era.

Update 2018-08-13: Thanks to Neil Saunders for crunching the numbers in a more rigourous manner and applying a correction for total number of publications published per year. The results are, as he notes, broadly similar.

Genomic makeovers: the number of ‘improved’ genome sequences is increasing

Image from flickr user londonmatt. Licensed under Creative Commons CC BY 2.0 license

Image from flickr user londonmatt. Licensed under Creative Commons CC BY 2.0 license

Excluding viruses, the genome that can claim to being completed before any others was that of the bacterium Haemophilus influenzae, the sequence of which was described in Science on July 28 1995.

I still find it pretty pretty amazing to recall that just over a year later, the world saw the publication of the first complete eukaryotic genome sequence, that of the yeast Saccharomyces cerevisiae.

The field of genomics and genome sequencing have continued to grow at breakneck speeds and the days of a genome sequence automatically meriting a front cover story in Nature or Science are long gone.

Complete vs Draft vs Improved

I’ve written previously about the fact that although more genomes than ever are being sequenced, fewer seem to be ‘complete’. I’ve also written a series of blog posts that address the rise of ‘draft genomes’.

Today I want to highlight another changing aspect of genome sequencing, that of the increasing number of publications that describe ‘improved’ genomes. Some recent examples:

Improving genomes is an increasing trend

To check whether there really are more ‘improved’ sequences being described, I looked in Google Scholar to see how many papers feature the terms ‘complete genome|assembly’ vs ‘draft genome|assembly’ vs ‘improved genome|assembly’ (these Google Scholar links reveal the slightly more complex query that I used). In gathering data I went back to 1995 (the date of the first published genome sequence).

As always with Google Scholar, these are not perfect search terms and they all pull in matches which are not strictly what I’m after, but it does reveal an interesting picture:

Number of publications in Google Scholar referencing complete vs draft vs improved genomes/assemblies

It is clear that the number of publications referencing ‘complete’ genomes/assemblies has been increasing at a steady rate. In contrast, publications describing ’draft’ genomes have grown rapidly in the last decade but the rate of increase is slowing. When it comes to ‘improved’ genomes it looks like we are in a period where many more papers are being published that are describing improved versions of existing genomes (in 2017 there was a 54% increase in such papers compared to 2016).

Why improve a genome?

I wonder how much of this growth reflects the sad truth that many genomes that were published in the post-Sanger, pre-nanopore era (approximately 2005–2015) were just not very good. Many people rushed to adopt the powerful new sequencing technologies provided by Illumina and others, and many genomes have been published using those technologies that are now being given makeovers by applying newer sequencing, scaffolding, and mapping technologies

The updated pine genome (the last publication on the list above) says as much in its abstract (emphasis mine):

The 22-gigabase genome of loblolly pine (Pinus taeda) is one of the largest ever sequenced. The draft assembly published in 2014 was built entirely from short Illumina reads, with lengths ranging from 100 to 250 base pairs (bp). The assembly was quite fragmented, containing over 11 million contigs whose weighted average (N50) size was 8206 bp. To improve this result, we generated approximately 12-fold coverage in long reads using the Single Molecule Real Time sequencing technology developed at Pacific Biosciences. We assembled the long and short reads together using the MaSuRCA mega-reads assembly algorithm, which produced a substantially better assembly, P. taeda version 2.0. The new assembly has an N50 contig size of 25 361, more than three times as large as achieved in the original assembly, and an N50 scaffold size of 107 821, 61% larger than the previous assembly.

Perhaps I’m being a bit harsh in saying that the first versions of many of these genomes that have been subsequently improved were not very good. The more important lesson to bear in mind is that, in reality, a genome is never finished and that all published sequences represent ‘works in progress’.

Chromosome-Scale Scaffolds And The State of Genome Assembly

Keith Robison has written another fantastic post on his Omics! Omics! blog which is a great read for two reasons.

First he looks at the issues regarding chromosome-size scaffolds that can now be produced with Hi-C sequencing approches. He then goes on to provide a brilliant overview of what the latest sequencing and mapping technologies mean for the field of genome assembly:

For high quality de novo genomes, the technology options appear to be converging for the moment on five basic technologies which can be mixed-and-matched.

  • Hi-C (in vitro or in vivo)
  • Rapid Physical Maps (BioNano Genomics)
  • Linked Reads (10X, iGenomX)
  • Oxford Nanopore
  • Pacific Biosciences
  • vanilla Illumina paired end

This second section should be required reading for anyone interested in genome assembly, particularly if you've been away for the field for a while.

Read the post: Chromosome-Scale Scaffolds And The State of Genome Assembly

Brief thoughts on Karyn Meltz Steinberg's ASHG 2015 talk on genome assembly improvement

I like it when people a) share their slides online and b) share their slides online soon after they give a talk somewhere. This is particularly helpful when want to quickly catch up on developments from a conference that you couldn't attend. Karyn Meltz Sternberg (@KMS_Meltzy on twitter) ticks both boxes because she posted her #ASHG2015 slides almost as soon as her talk finished. The title of her talk was:

Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing

Her slides — hosted on Slideshare — are embedded below.

What interested me from this talk is the use of sequence maps generated by the BioNano Genomics Irys platform to improve genome assemblies. This technology seems to be growing in popularity, offering an easier (and more powerful?) alternative to 'traditional' optical map solutions. This work is part of the McDonnell Genome Institute's Reference Genomes Improvement project, which includes the following — very laudable — aim:

  • We plan to identify and resolve issues (misassemblies, sequence errors, and gaps) within the current reference GRCh38.

I find it interesting that this project has also defined two levels of genome status:

Gold Genome: A high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions.

Platinum Genome: A contiguous, haplotype-resolved representation of the entire genome.

Not clear from these definitions whether platinum genomes can still include short regions of unknown bases (Ns). A figure on the Reference Genomes Improvement project page also hints at a 'Silver' status, making me think it it only a matter of time before we see the addition of a credit-card-esque 'diamond' status level: no unknown bases, with full representation of tandem repeat arrays, e.g. centromeres, and priority booking for VIP tickets at major sporting events.

L50 vs N50: that's another fine mess that bioinformatics got us into

N50 is a statistic that is widely used to describe genome assemblies. It describes an average length of a set of sequences, but the average is not the mean or median length. Rather it is the length of the sequence that takes the sum length of all sequences — when summing from longest to shortest — past 50% of the total size of the assembly. The reasons for using N50, rather than the mean or median length, is something that I've written about before in detail.

The number of sequences evaluated at the point when the sum length exceeds 50% of the assembly size is sometimes referred to as the L50 number. Admittedly, this is somewhat confusing: N50 describes a sequence length whereas L50 describes a number of sequences. This oddity has led to many people inverting the usage of these terms. This doesn't help anyone and leads to confusion and to debate.

I believe that the aforementioned definition of N50 was first used in the 2001 publication of the human genome sequence:

We used a statistic called the ‘N50 length’, defined as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L.

I've since had some independent confirmation of this from Deanna Church (@deannachurch):

I also have a vague memory that other genome sequences — that were made available by the Sanger Institute around this time — also included statistics such as N60, N70, N80 etc. (at least I recall seeing these details in README files on an FTP site). Deanna also pointed out that the Celera Human Genome paper (published in Science, also in 2001) describes something that we might call N25 and N90, even though they didn't use these terms in the paper:

More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or large

I don't know when L50 first started being used to describe lengths, but I would bet it was after 2001. If I'm wrong, please comment below and maybe we can settle this once and for all. Without evidence for an earlier use of L50 to describe lengths, I think people should stick to the 2001 definition of N50 (which I would also argue is the most common definition in use today).

Updated 2015-06-26 - Article includes new evidence from Deanna Church.

Great post by Lex Nederbragt about why we need graph-based representations of genome sequences

In a recent blog post, Lex Nederbragt explains why we all need to be moving to graph-based representations of genome sequences (and getting away from linear representations).

In this post I will provide some background, explain the reasons for moving towards graph-based representations, and indicate some challenges associated with this development.

After listing the many challenges involved in moving towards a graph-based future, he refers to the fact that current efforts have not been widely adopted:

Two file formats to represent the graph been developed: FASTG and GFA. FASTG has limited uptake, only two assembly programs (ALLPATHS_LG and SPAdes) will output in that format. GFA parsing is currently only experimentally in the ABYSS assembler, and [the VG program] is able to output it.

The lack of a widely recognized and supported standard for representing variation inherent in a genome sequence is, in my opinion, a major barrier to moving forward. Almost all bioinformatics software that works with genome sequence data expects a single sequence with no variation. It will require a whole new generation of tools to work with a variant-based format, but tool developers will be reluctant to write new code if there is no clear agreement on what is the new de facto file format.

New paper provides a great overview of the current state of genome assembly

The following paper by Stephen Richards and Shwetha Murali has just appeared in the journal Current Opinion in Insect Science:

Best practices in insect genome sequencing: what works and what doesn’t

In some ways I wish they had chosen a different title as the focus of this paper is much more about genome assembly than genome sequencing. Furthermore, it provides a great overview of all of the current strategies in genome assembly. This should be of interest to any non-insect researchers interested in the best way of putting a genome together. Here is part of the legend from a very informative table in the paper:

Table 1 — De novo genome assembly strategies:
Assembly software is designed for a specific sequencing and assembly strategy. Thus sequence must be generated with the assembly software and algorithm in mind, choosing a sequence strategy designed for a different assembly algorithm, or sequencing without thinking about assembly is usually a recipe for poor un-publishable assemblies. Here we survey different assembly strategies, with different sequence and library construction requirements.

Metassembler: Merging and optimizing de novo genome assemblies

There's a great new paper in bioRxiv by Alejandro Hernandez Wences and Michael Schatz. They directly address something I wondered about as we were running the Assemblathon contests. Namely, can you combine some of the submitted assemblies to make an even better assembly? Well the answer seems to be a resounding 'yes'.

For each of three species in the Assemblathon 2 project we applied our algorithm to the top 6 assemblies as ranked by the cumulative Z-score reported in the paper…

We evaluated the correctness and contiguity of the metassembly at each merging step using the metrics used by the Assemblathon 2 evaluation…

In all three species, the contiguity statistics are significantly improved by our metassembly algorithm

Hopefully their Metassembler tool will be useful in improving many other poor quality assemblies that are out there!