The changing landscape of sequencing platforms that underpin genome assembly

From Flickr user  itsrick208 .  CC BY-NC 2.0

From Flickr user itsrick208CC BY-NC 2.0

In my last blog post I looked at the the amazing growth over the last two decades in publications that relate to genome assembly.

In this post, I try seeing whether Google Scholar can also shed any light on which sequencing technologies have been used to help understand, and improve, genome assembly.

Here is a rough overview of the major sequencing platforms that have underpinned genome assembly over the years. I’ve focused on time points when there were sequencing instruments that people were actually using rather than when the technology was first invented or described. This is why I start Sanger sequencing at 1995 with the AB310 sequencer rather than 1977.

Click to enlarge

Return to Google Scholar

So how can you find publications which concern genome assembly using these technologies? Well here are my Google Scholar searches that I used to try to identify relevant publications.

  1. Sanger — "genome assembly"|"de novo assembly" sanger — I had to exclude the Sanger’s website address as this was used in many papers that might not be talking about Sanger sequencing per se.
  2. Roche 454 — "genome assembly"|"de novo assembly" 454 (roche |pyrosequencing) — another tricky one as ‘454’ alone was not a suitable keyword for searching.
  3. Illumina — "genome assembly"|"de novo assembly" (illumina|solexa) — obviously need to include Solexa in this search as well.
  4. ABI SOLiD — "genome assembly"|"de novo assembly" “ABI solid”
  5. Ion Torrent — "genome assembly"|"de novo assembly" "ion torrent”
  6. PacBio — "genome assembly"|"de novo assembly" ("PacBio"|"Pacific Biosciences”)
  7. Oxford Nanopore Technologies — "genome assembly"|"de novo assembly" "Oxford Nanopore”

Now obviously, many of these searches are flawed and are going to miss publications or include false positives. This makes comparing the absolute numbers of publications between technologies potentially misleading. However, it should still be illuminating to look at the trends of how publications for each of these technologies have changed over time.

The results

As in my last graph, I plot the number of publications on a log scale.

Click to enlarge


  1. Publications about genome assembly that mention Sanger sequencing dominate the first decade of this graph before being overtaken by Illumina in 2009.
  2. The growth of publications for Sanger is starting to slow down
  3. Publications for Roche 454 peaked in 2015 and have started to decline
  4. Publications concerning Ion Torrent peaked a year later in 2016
  5. ABI SOLiD shows the clearest ‘rise and fall’ pattern with five years now of declining publications about genome assembly
  6. The rate of growth for PacBIo publications has been pretty solid but may have just slowed a little in 2017
  7. Oxford Nanopore, the newest kid on the block — in terms of commercially available products — has been on a solid period of exponential growth and looks set to overtake Ion Torrent (and maybe Roche 454) this year.

Are we about to reach ‘peak genome assembly’?

Sanger Peak . Image from Google Maps.

Sanger Peak. Image from Google Maps.

The ever-declining costs of DNA sequencing technologies — no, I’m not going to show that graph — has meant that the field of genome assembly has exploded over the last decade.

Plummeting costs are obviously not the only reason behind this. The evolving nature of sequencing technologies has meant that this year has pushed us into the brave new era of megabase pair read lengths!

Think of the poor budding yeast: the first eukaryotic species to have its (12 Mbp) genome sequenced. There was a time when the sequencing of individual yeast chromosomes would merit their own Nature publication! Now only chromosome IV remains as the last yeast chromosome whose length couldn’t be exceeded by a single Oxford Nanopore read (but probably not for much longer!). Update 2018-09-12: a 2.2 Mbp Nanopore read means that chromsome IV's length has now been eclipsed!

Looking for genome assembly publications

I turned to the font of all (academic) knowledge, Google Scholar, for answers. I wanted to know whether interest in genome assembly had reached a peak, and by ‘interest’ I mean publications or patents that specifically mention either ‘genome assembly’ or ‘de novo assembly’.

Some obvious caveats:

  1. Google Scholar is not a perfect source of publications: some papers are missing, some appear multiple times, and occasionally some are associated with the wrong year.
  2. Publications are increasing in many fields due to more scientists being around and the inexorable rise of if-you-pay-us-money-and-randomly-hit-keys-on-your-keyboard-we-will-publish-it publishing. So a rise in publications in topic 'X' does not necessarily reflect more interest in that topic.
  3. Not all publications concerning genome assembly will contain the phrases ‘genome assembly’ or ‘de novo assembly’.

Caveats aside, let’s see what Google thinks about the state of genome assembly:

Click to enlarge

Does this tell us anything?

So there’s clearly been a pretty explosive growth in publications concerning genome assembly over the last couple of decades. Interestingly, the data from 2017 suggest that the period of exponential growth is starting to slow just a little bit. However, it would seem that we have not reached ‘peak genome assembly’ just yet.

There are, no doubt, countless hundreds (thousands?) of publications that concern technical aspects of genome assembly which have reached dead ends or which have become obsolete (pipelines for your ABI SOLiD data?).

Maybe we are starting to reach an era where the trio of leading technologies (Illumina, Pacific Biosciences, and Oxford Nanopore) are good enough to facilitate — alone, or in combination — easier (or maybe less troublesome) genome assemblies. I’ve previously pointed out how there are more ‘improved’ assemblies being published than ever before.

Maybe the field has finally moved the focus away from ‘how do we do get this to work properly?’ to ‘what shall we assemble next?’. In a follow-up post, I’ll be looking at the rise and fall of different sequencing technologies throughout this era.

Update 2018-08-13: Thanks to Neil Saunders for crunching the numbers in a more rigourous manner and applying a correction for total number of publications published per year. The results are, as he notes, broadly similar.

How and why the Institute of Cancer Research are using their new Illumina NovaSeq

Image credit: The Institute of Cancer Research, London

Image credit: The Institute of Cancer Research, London

Yesterday, the The Institute of Cancer Researchdisclaimer: that's where I work — published a new blog post where they spoke to Nik Matthews, Genomics Manager in the ICR’s Tumour Profiling Unit, about the Illumina NovaSeq sequencing platform.

It's a little more technical than some of the ICR 'Science Talk' blog posts that we usually publish which is why I thought I'd link to it here.

As someone who was started their PhD around the time the yeast genome was being finished I still am shocked by how far the world of DNA sequencing has come. This is something Nik refers to in his opening answer:

We can now produce data equivalent to the size of the original human genome project every six minutes, which is astonishing.

By comparison, the yeast genome project — an international collaboration involving many different labs — took over five years to sequence its genome…all 12 Mbp of it! Read the full blog post to find out more about how, and why, the ICR adopted the NovaSeq platform:

Illumina's new NovaSeq platform unveiled at The Institute of Cancer Research, London

Dr Nik Matthews, Genomics Manager in the ICR's Tumour Profiling Unit. Credit: ICR

Dr Nik Matthews, Genomics Manager in the ICR's Tumour Profiling Unit. Credit: ICR

It feels a bit strange to be using this blog to link to a news post at my current employer, but I'm happy to share the news that the ICR has become the first organisation in the UK to deploy Illumina's NovaSeq platform.

The ICR's Dr Chris Lord, Deputy Director of the Breast Cancer Now Research Centre, had this to say:

One key area we are keen to use the NovaSeq sequencer for is to discover new ways to select the best available treatment for each individual cancer patient’s specific disease.

If we can do this, we should be able to improve how a significant number of patients are treated. With the NovaSeq system, this kind of work is now feasible – this will be a real game-changer for a lot of the work across the ICR.

Read more in the full news article on the ICR website: