The changing landscape of sequencing platforms that underpin genome assembly

 From Flickr user  itsrick208 .  CC BY-NC 2.0

From Flickr user itsrick208CC BY-NC 2.0

In my last blog post I looked at the the amazing growth over the last two decades in publications that relate to genome assembly.

In this post, I try seeing whether Google Scholar can also shed any light on which sequencing technologies have been used to help understand, and improve, genome assembly.

Here is a rough overview of the major sequencing platforms that have underpinned genome assembly over the years. I’ve focused on time points when there were sequencing instruments that people were actually using rather than when the technology was first invented or described. This is why I start Sanger sequencing at 1995 with the AB310 sequencer rather than 1977.

Click to enlarge

Return to Google Scholar

So how can you find publications which concern genome assembly using these technologies? Well here are my Google Scholar searches that I used to try to identify relevant publications.

  1. Sanger — "genome assembly"|"de novo assembly" sanger — I had to exclude the Sanger’s website address as this was used in many papers that might not be talking about Sanger sequencing per se.
  2. Roche 454 — "genome assembly"|"de novo assembly" 454 (roche |pyrosequencing) — another tricky one as ‘454’ alone was not a suitable keyword for searching.
  3. Illumina — "genome assembly"|"de novo assembly" (illumina|solexa) — obviously need to include Solexa in this search as well.
  4. ABI SOLiD — "genome assembly"|"de novo assembly" “ABI solid”
  5. Ion Torrent — "genome assembly"|"de novo assembly" "ion torrent”
  6. PacBio — "genome assembly"|"de novo assembly" ("PacBio"|"Pacific Biosciences”)
  7. Oxford Nanopore Technologies — "genome assembly"|"de novo assembly" "Oxford Nanopore”

Now obviously, many of these searches are flawed and are going to miss publications or include false positives. This makes comparing the absolute numbers of publications between technologies potentially misleading. However, it should still be illuminating to look at the trends of how publications for each of these technologies have changed over time.

The results

As in my last graph, I plot the number of publications on a log scale.

Click to enlarge


  1. Publications about genome assembly that mention Sanger sequencing dominate the first decade of this graph before being overtaken by Illumina in 2009.
  2. The growth of publications for Sanger is starting to slow down
  3. Publications for Roche 454 peaked in 2015 and have started to decline
  4. Publications concerning Ion Torrent peaked a year later in 2016
  5. ABI SOLiD shows the clearest ‘rise and fall’ pattern with five years now of declining publications about genome assembly
  6. The rate of growth for PacBIo publications has been pretty solid but may have just slowed a little in 2017
  7. Oxford Nanopore, the newest kid on the block — in terms of commercially available products — has been on a solid period of exponential growth and looks set to overtake Ion Torrent (and maybe Roche 454) this year.

Are we about to reach ‘peak genome assembly’?

  Sanger Peak . Image from Google Maps.

Sanger Peak. Image from Google Maps.

The ever-declining costs of DNA sequencing technologies — no, I’m not going to show that graph — has meant that the field of genome assembly has exploded over the last decade.

Plummeting costs are obviously not the only reason behind this. The evolving nature of sequencing technologies has meant that this year has pushed us into the brave new era of megabase pair read lengths!

Think of the poor budding yeast: the first eukaryotic species to have its (12 Mbp) genome sequenced. There was a time when the sequencing of individual yeast chromosomes would merit their own Nature publication! Now only chromosome IV remains as the last yeast chromosome whose length couldn’t be exceeded by a single Oxford Nanopore read (but probably not for much longer!).

Looking for genome assembly publications

I turned to the font of all (academic) knowledge, Google Scholar, for answers. I wanted to know whether interest in genome assembly had reached a peak, and by ‘interest’ I mean publications or patents that specifically mention either ‘genome assembly’ or ‘de novo assembly’.

Some obvious caveats:

  1. Google Scholar is not a perfect source of publications: some papers are missing, some appear multiple times, and occasionally some are associated with the wrong year.
  2. Publications are increasing in many fields due to more scientists being around and the inexorable rise of if-you-pay-us-money-and-randomly-hit-keys-on-your-keyboard-we-will-publish-it publishing. So a rise in publications in topic 'X' does not necessarily reflect more interest in that topic.
  3. Not all publications concerning genome assembly will contain the phrases ‘genome assembly’ or ‘de novo assembly’.

Caveats aside, let’s see what Google thinks about the state of genome assembly:

Click to enlarge

Does this tell us anything?

So there’s clearly been a pretty explosive growth in publications concerning genome assembly over the last couple of decades. Interestingly, the data from 2017 suggest that the period of exponential growth is starting to slow just a little bit. However, it would seem that we have not reached ‘peak genome assembly’ just yet.

There are, no doubt, countless hundreds (thousands?) of publications that concern technical aspects of genome assembly which have reached dead ends or which have become obsolete (pipelines for your ABI SOLiD data?).

Maybe we are starting to reach an era where the trio of leading technologies (Illumina, Pacific Biosciences, and Oxford Nanopore) are good enough to facilitate — alone, or in combination — easier (or maybe less troublesome) genome assemblies. I’ve previously pointed out how there are more ‘improved’ assemblies being published than ever before.

Maybe the field has finally moved the focus away from ‘how do we do get this to work properly?’ to ‘what shall we assemble next?’. In a follow-up post, I’ll be looking at the rise and fall of different sequencing technologies throughout this era.

Update 2018-08-13: Thanks to Neil Saunders for crunching the numbers in a more rigourous manner and applying a correction for total number of publications published per year. The results are, as he notes, broadly similar.

Genomic makeovers: the number of ‘improved’ genome sequences is increasing

 Image from  flickr user londonmatt . Licensed under Creative Commons  CC BY 2.0 license

Image from flickr user londonmatt. Licensed under Creative Commons CC BY 2.0 license

Excluding viruses, the genome that can claim to being completed before any others was that of the bacterium Haemophilus influenzae, the sequence of which was described in Science on July 28 1995.

I still find it pretty pretty amazing to recall that just over a year later, the world saw the publication of the first complete eukaryotic genome sequence, that of the yeast Saccharomyces cerevisiae.

The field of genomics and genome sequencing have continued to grow at breakneck speeds and the days of a genome sequence automatically meriting a front cover story in Nature or Science are long gone.

Complete vs Draft vs Improved

I’ve written previously about the fact that although more genomes than ever are being sequenced, fewer seem to be ‘complete’. I’ve also written a series of blog posts that address the rise of ‘draft genomes’.

Today I want to highlight another changing aspect of genome sequencing, that of the increasing number of publications that describe ‘improved’ genomes. Some recent examples:

Improving genomes is an increasing trend

To check whether there really are more ‘improved’ sequences being described, I looked in Google Scholar to see how many papers feature the terms ‘complete genome|assembly’ vs ‘draft genome|assembly’ vs ‘improved genome|assembly’ (these Google Scholar links reveal the slightly more complex query that I used). In gathering data I went back to 1995 (the date of the first published genome sequence).

As always with Google Scholar, these are not perfect search terms and they all pull in matches which are not strictly what I’m after, but it does reveal an interesting picture:

Number of publications in Google Scholar referencing complete vs draft vs improved genomes/assemblies

It is clear that the number of publications referencing ‘complete’ genomes/assemblies has been increasing at a steady rate. In contrast, publications describing ’draft’ genomes have grown rapidly in the last decade but the rate of increase is slowing. When it comes to ‘improved’ genomes it looks like we are in a period where many more papers are being published that are describing improved versions of existing genomes (in 2017 there was a 54% increase in such papers compared to 2016).

Why improve a genome?

I wonder how much of this growth reflects the sad truth that many genomes that were published in the post-Sanger, pre-nanopore era (approximately 2005–2015) were just not very good. Many people rushed to adopt the powerful new sequencing technologies provided by Illumina and others, and many genomes have been published using those technologies that are now being given makeovers by applying newer sequencing, scaffolding, and mapping technologies

The updated pine genome (the last publication on the list above) says as much in its abstract (emphasis mine):

The 22-gigabase genome of loblolly pine (Pinus taeda) is one of the largest ever sequenced. The draft assembly published in 2014 was built entirely from short Illumina reads, with lengths ranging from 100 to 250 base pairs (bp). The assembly was quite fragmented, containing over 11 million contigs whose weighted average (N50) size was 8206 bp. To improve this result, we generated approximately 12-fold coverage in long reads using the Single Molecule Real Time sequencing technology developed at Pacific Biosciences. We assembled the long and short reads together using the MaSuRCA mega-reads assembly algorithm, which produced a substantially better assembly, P. taeda version 2.0. The new assembly has an N50 contig size of 25 361, more than three times as large as achieved in the original assembly, and an N50 scaffold size of 107 821, 61% larger than the previous assembly.

Perhaps I’m being a bit harsh in saying that the first versions of many of these genomes that have been subsequently improved were not very good. The more important lesson to bear in mind is that, in reality, a genome is never finished and that all published sequences represent ‘works in progress’.

Fun at the Festival: a mini-report of the 2018 Festival of Genomics in London

 Graffiti wall at the Festival of Genomics

Graffiti wall at the Festival of Genomics

Last week I once again attended the excellent Festival of Genomics conference in London. As before, I was attending in order to produce some coverage for The Institute of Cancer Research (where I work).

I really enjoy the mixture of talks at this conference which always has a strong leaning towards medicine in general and the rapid integration of genomics in the NHS in particular. This was a topic I explored in more detail in a blog post last year for the ICR.

I like how the conference organisers, Front Line Genomics, make an effort to ensure the conference is fun and engaging. It is easy to dismiss things like a 'grafitti walls' and 'recharge stations' (where you can power up your mobile phone and get a massage) as gimmicks, but I think they add to a feeling that this is a modern and vibrant conference.

NHS meets NGS

Opening the conference was a presentation from Sir Malcolm Grant, the Chairman of NHS England. He presented an update on Genomics England's 100,000 Genomes Project.

Sir Malcolm noted that consent is such an important part of this project as participants are consenting to provide information that may affect others, e.g. their children and heirs. He stressed the importance of ensuring public trust and support as the project moves forwards.

Although initial progress towards achieving those 100,000 genomes may have been slower than some would have liked, work has been accelerating. The project has taken almost five years to reach the halfway point but is now on course to reach the 100K milestone within the next 12 months.

The following day saw Genomics England's Chairman, Sir John Chisholm, take to the stage for a chat with Carl Smith (Managing Editor of Front Line Genomics). He stressed that people should think of the 100,000 genomes project as a "pilot for the main game", i.e. the routine sequencing of patients within the NHS.

Rigged for silent running

The conference has four 'stages' but as the whole area at the ExCel Area is just one big open space, they make use of wireless headphones to have conference areas which are effectively silent to people walking past.

In addition to headphones being left on each seat, there are also many additional headsets that can be given out to people who are just standing by the sides of the 'stages' to more casually listen in to each session.

When genomics meets radiotherapy

This year the ICR was honoured with our own conference session in which four early-career researchers talked about how they used genomics data in their own areas of cancer research.

I have writen a Science Talk blog post for the ICR that focuses on a presentation at the conference by Dr James Campbell, who is a Lead Bioinformatican at the ICR. He is using genome data from almost 2,000 patients that have undergone radiotherapy treatment for prostate cancer, in order to develop a model which predicts how well a new patient — given their particular set of genotypes and clinical factors — will respond to radiotherapy.

You can read the blog post here:

When genes beat cheese

Screen Shot 2017-11-28 at 20.54.48.png

Google Trends is an amazing tool that can shed light on important historical trends relating to politics, religion, and society as a whole. It can also be used to see whether 'genes' has ever been a more popular search term than 'cheese'.

Looking across the whole corpus of Google Trends data (2004–present) it reveals — in the UK at least — that 'genes' came tantalisingly close to overtaking 'cheese' in popularity in 2007:

But wait! If we zoom in to that early part of 2007 we see that for a glorious week at the start of February that 'genes' was indeed a more popular search term than 'cheese'!

British genes for British people?

This historic victory for genetics seems to be a British phenomenon. Running the same search in other countries, or using the 'worldwide' dataset, doesn't reveal the same pattern. Here is what the genes vs cheese fight looks like in America:


I have described an important historic event but I am at a loss to explain why this trend emerged. The trend starts on January 30th 2007…I have searched Google for genes-related news around this time but nothing notable crops up. Any ideas?

Back from the dead…time for a new JABBA award!

jabba logo.png

I really wasn't intending to hand out any more JABBA awards. The last time I gave out an award for a bogus bioinformatics acronym was back in February 2016 and that was meant to be that.

However, I recently saw something that sent a shiver down my spine and I felt obliged to dust off the JABBA award one more time (for now anyway).

Let's get straight to the point. Published in bioarXiv is a new preprint:

Now don't get me wrong, I love burritos and I think it's kind of a fun name for a piece of software. I just happen to think that in this case it is a somewhat tenuous acronym. So how do you get to make BURRITO the name of your tool?

Browser Utility for Relating micRobiome Information on Taxonomy and functiOn

It's the inclusion of 'O' from 'functiOn' that gets me.I guess 'BURMITF' didn't have such a good ring to it.

What else is on the menu?

Note that BURRITO shouldn't be confused with the QIIME application controller code called burrito-fillings or the electronic lab notebook software for computational researchers known as Burrito.

Also, you get bonus points if you can use BURRITO with SALSA. Of course, if BURRITO doesn't work out for you, then maybe try TACO or…er, TACO.

How and why the Institute of Cancer Research are using their new Illumina NovaSeq

 Image credit: The Institute of Cancer Research, London

Image credit: The Institute of Cancer Research, London

Yesterday, the The Institute of Cancer Researchdisclaimer: that's where I work — published a new blog post where they spoke to Nik Matthews, Genomics Manager in the ICR’s Tumour Profiling Unit, about the Illumina NovaSeq sequencing platform.

It's a little more technical than some of the ICR 'Science Talk' blog posts that we usually publish which is why I thought I'd link to it here.

As someone who was started their PhD around the time the yeast genome was being finished I still am shocked by how far the world of DNA sequencing has come. This is something Nik refers to in his opening answer:

We can now produce data equivalent to the size of the original human genome project every six minutes, which is astonishing.

By comparison, the yeast genome project — an international collaboration involving many different labs — took over five years to sequence its genome…all 12 Mbp of it! Read the full blog post to find out more about how, and why, the ICR adopted the NovaSeq platform:

Bioinformatics blogs

I was surprised to see this blog featured in a recent list of the Top 75 Bioinformatics Blogs and Websites for Bioinformaticians.

Any list that includes this blog — which I barely ever update these days — feels a little bit dubious, especially when I'm listed above some genuinely useful blogs.

Anyway, there are many genuinely useful blogs on this list so I recommend having a look at it. I also made an attempt a few years at listing some of my own favourite bioinformatics blogs…a list which seems to remain relevant.

Illumina's new NovaSeq platform unveiled at The Institute of Cancer Research, London

 Dr Nik Matthews, Genomics Manager in the ICR's Tumour Profiling Unit. Credit: ICR

Dr Nik Matthews, Genomics Manager in the ICR's Tumour Profiling Unit. Credit: ICR

It feels a bit strange to be using this blog to link to a news post at my current employer, but I'm happy to share the news that the ICR has become the first organisation in the UK to deploy Illumina's NovaSeq platform.

The ICR's Dr Chris Lord, Deputy Director of the Breast Cancer Now Research Centre, had this to say:

One key area we are keen to use the NovaSeq sequencer for is to discover new ways to select the best available treatment for each individual cancer patient’s specific disease.

If we can do this, we should be able to improve how a significant number of patients are treated. With the NovaSeq system, this kind of work is now feasible – this will be a real game-changer for a lot of the work across the ICR.

Read more in the full news article on the ICR website:

Chromosome-Scale Scaffolds And The State of Genome Assembly

Keith Robison has written another fantastic post on his Omics! Omics! blog which is a great read for two reasons.

First he looks at the issues regarding chromosome-size scaffolds that can now be produced with Hi-C sequencing approches. He then goes on to provide a brilliant overview of what the latest sequencing and mapping technologies mean for the field of genome assembly:

For high quality de novo genomes, the technology options appear to be converging for the moment on five basic technologies which can be mixed-and-matched.

  • Hi-C (in vitro or in vivo)
  • Rapid Physical Maps (BioNano Genomics)
  • Linked Reads (10X, iGenomX)
  • Oxford Nanopore
  • Pacific Biosciences
  • vanilla Illumina paired end

This second section should be required reading for anyone interested in genome assembly, particularly if you've been away for the field for a while.

Read the post: Chromosome-Scale Scaffolds And The State of Genome Assembly

What did I learn at the Festival of Genomics conference?

Last week I attended the excellent Festival of Genomics conference in London, organised by Front Line Genomics. This was the first time I had been to a conference as a communications person rather than as a scientist…something that felt quite strange.

In addition to live-tweeting many talks for The Institute of Cancer Research where I work, I also recorded some videos of ICR scientists on the conference floor. All were asked to respond to the same simple question: Why is genomics important for cancer research?. You can see the video responses on the ICR's YouTube channel.

I also made a very short video to highlight one unusual aspect of the conference…the talks were pretty much silent. Wireless headphones worn by all audience members meant that there was no need to amplify the speakers…and therefore no need for the four different 'lecture theatres' to actually have any walls!


My first ICR blog post!

My final task was to write a blog post about some aspect of the conference. Before the conference started, I thought I might write something that was more focused on genomics technologies. However, I was surprised by how much of the conference covered genomics as part of healthcare.

In particular, I was left with the sense that genomics is finally delivering on some of the promises made back in 2003 when the human genome sequence was published. One of the target areas that was mentioned in this 2003 NIH press release was 'New methods for the early detection of disease'.

This is something that is now possible with whole genome sequencing being deployed as part of the 100,000 genomes project (undertaken by Genomics England). The ability to screen a patient for all known genetic diseases leads to many concerns and challenges — you should see Gattaca if you haven't already done so — but it was heartening to see how much groundwork has been put in to stay on top of some of these issues.

This is my first proper blog post for the ICR, and if you are interested in finding out more, please read my post on the ICR's Science Talk blog:

We have not yet reached 'peak CEGMA': record number of citations in 2016

Over the last few weeks, I've been closely watching the number of citations to our original 2007 CEGMA paper. Despite making it very clear on the CEGMA webpage that is has been 'discontinued' and despite leaving a comment in PubMed Commons that people should consider alternative tools, citations continue to rise.

This week we passed a milestone with the paper getting more citations in 2016 than in 2015. As the paper's Google Scholar page clearly shows, the citations have increased year-on-year ever since it was published:

While it is somewhat flattering to see research that I was involved so highly cited — I can't imagine that many papers show this pattern of citation growth over such a long period — I really hope that 2016 marks 'peak CEGMA'.

CEGMA development started in 2005, a year that pre-dates technologies such as Solexa sequencing! People should really stop using this tool and try using something like BUSCO instead.

Assembling a twitter following: people continue to be interested in genome assembly

Late in 2010, I was asked to help organise what would initially become The Assemblathon and then more formally Assemblathon 1. One of the very first things I did was to come up with the name itself — more here on naming bioinformatics projects — register the domain name, and secure the Twitter account @Assemblathon.

The original goal was to use the website and Twitter account to promote the contest and then share details of how the competition was unfolding. This is exactly what we did, all the way through to the publication of the Assemblathon 1 paper in late 2011. Around this time it seemed to make sense to also use the Twitter account to promote anything else related to the field of genome assembly and that is exactly what I did.

As well as tweeting a lot about Assemblathon 2 and a little bit about the aborted but oh-so-close-to-launching Assemblathon 3, I have found time to tweet (and retweet) several thousand links to many relevant publications and software tools.

It seems that people are finding this useful as the account keeps gaining a steady trickle of followers. The graph below shows data from when I started tracking the follower growth in early 2014:

All of which leaves me to make two concluding remarks:

  1. There can be tremendous utility in having an outlet — such as a Twitter account — to focus on a very niche subject (maybe some would say that genome assembly is no longer a niche field?).
  2. Although I am no longer working on the Assemblathon projects — I'm not even a researcher any more — I'm happy to keep posting to this account as long as people find it useful.

101 questions with a bioinformatician #38: Gene Myers

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Gene Myers is a Director at the Max-Planck Institute for Molecular Cell Biology
and Genetics
(MPI-CBG) and the Klaus-Tschiar Chair of the Center for Systems Biology Dresden (CSBD).

Maybe you've heard of Gene for his pivotal role in developing the Celera genome assembler which led to genome assemblies for mouse, human, and drosophila (the first whole genome shotgun assembly of a multicellular organism). You may also know Gene from his work in helping develop a fairly obscure bioinformatics tool that no-one uses (just the 58,000 citations in Google Scholar).

His current research focuses on developing new methods for microscopy and image analysis; from his research page:

"The overarching goal of our group is to build optical devices, collect molecular reagents, and develop analysis software to monitor in as much detail as possible the concentration and localization of proteins, transcripts, and other entities of interest within a developing cohort of cells for the purpose of [developing] a biophysical understanding of development at the level of cell communication and force generation."

You can find out more about Gene by visiting his research page on the MPI-CBG website or by following him on Twitter (@TheGeneMyers). Finally, if you are interested in genome assembly then you may also want to check out his dazzlerblog ('The Dresden AZZembLER for long read DNA projects'). And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

The underlying technology is always changing and presenting new challenges, and the field is still evolving and becoming more "sophisticated". That is, there are still cool unsolved problems to explore despite the fact that some core aspects of the field, now in its middle-age in my view, are "overworked".

010. What's something that you don't enjoy about current bioinformatics research?

I'm really bored with networks and -omics. Stamp collecting large parts lists seems to have become the norm despite the fact that it rarely leads to much mechanistic insight. Without an understanding of spatial organization and soft-matter physics, most important biological phenomenon cannot be explained (e.g. AP axis orientation at the outset of worm embryogenesis).

Additionally, I was disgusted with the short-read DNA sequencers that, while cheap, produce truly miserable reconstructions of novel genomes. Good only for resequencing and digital gene expression/transcriptomics. Thank God for the recent emergence of the long-read machines.

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

At age 18 its not so much about career specifics but one's general approach to education. For myself, I would have said, "go to class knuckle head and learn something from all the great researchers that are your teachers (instead of hanging out in your dorm room reading text books)", and for general advice to all at that stage I would say, learn mathematics and programming now while your mind is young and supple, you can acquire a large corpus of knowledge about biological processes later.

100. What's your all-time favorite piece of bioinformatics software, and why?

I don't use bioinformatics software, I make it :-) My favorite problem, yet fully solved in my opinion, is DNA sequence assembly -- it is a combinatorially very rich string problem.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

N — as it encompasses all the rest :-)

How would you describe genomics without using any scientific jargon?

Yesterday was the Annual Student Conference at The Institute of Cancer Research, London. As part of the ICR's Communications team, we helped run a session about the myriad ways that science can (and should) be communicated more effectively.

During this session my colleague Rob Dawson (@BioSciFan on Twitter) introduced a fun tool called the The Up-Goer Five Text Editor. This tool lets you edit text…but only by using the 1,000 most common words in the English language.

It was inspired by an XKCD comic which used the same approach to try to explain how an Apollo moon rocket works. Using this tool really makes you appreciate that just about every scientific word you might use is not on the list. So it is a good way of making you think about how to communicate science to a lay audience, completely free of jargon.

I thought I would have a go at explaining genomics. I couldn't even use the words 'science', 'machine', or 'blueprint' (let alone 'gene', 'DNA', or 'molecule'). Here is my attempt:

In every cell of our bodies, there is a written plan that explains how that cell should make all of the things that it needs to make. A cell that grows hair is very different to a cell that is in your heart or brain. However, all cells still have the same plan but different parts of the plan are turned on in different cells.

We first understood what the full plan looks like for humans in 2003. We can use computers  to make sense of the plan and to learn more about how many parts are needed to make a human (about 20,000). The better we understand the plan, the more we might be able to make human lives better.

You can edit my version online but I encourage people to try explaining your own field of work using this tool.