Genomic makeovers: the number of ‘improved’ genome sequences is increasing

 Image from  flickr user londonmatt . Licensed under Creative Commons  CC BY 2.0 license

Image from flickr user londonmatt. Licensed under Creative Commons CC BY 2.0 license

Excluding viruses, the genome that can claim to being completed before any others was that of the bacterium Haemophilus influenzae, the sequence of which was described in Science on July 28 1995.

I still find it pretty pretty amazing to recall that just over a year later, the world saw the publication of the first complete eukaryotic genome sequence, that of the yeast Saccharomyces cerevisiae.

The field of genomics and genome sequencing have continued to grow at breakneck speeds and the days of a genome sequence automatically meriting a front cover story in Nature or Science are long gone.

Complete vs Draft vs Improved

I’ve written previously about the fact that although more genomes than ever are being sequenced, fewer seem to be ‘complete’. I’ve also written a series of blog posts that address the rise of ‘draft genomes’.

Today I want to highlight another changing aspect of genome sequencing, that of the increasing number of publications that describe ‘improved’ genomes. Some recent examples:

Improving genomes is an increasing trend

To check whether there really are more ‘improved’ sequences being described, I looked in Google Scholar to see how many papers feature the terms ‘complete genome|assembly’ vs ‘draft genome|assembly’ vs ‘improved genome|assembly’ (these Google Scholar links reveal the slightly more complex query that I used). In gathering data I went back to 1995 (the date of the first published genome sequence).

As always with Google Scholar, these are not perfect search terms and they all pull in matches which are not strictly what I’m after, but it does reveal an interesting picture:

Number of publications in Google Scholar referencing complete vs draft vs improved genomes/assemblies

It is clear that the number of publications referencing ‘complete’ genomes/assemblies has been increasing at a steady rate. In contrast, publications describing ’draft’ genomes have grown rapidly in the last decade but the rate of increase is slowing. When it comes to ‘improved’ genomes it looks like we are in a period where many more papers are being published that are describing improved versions of existing genomes (in 2017 there was a 54% increase in such papers compared to 2016).

Why improve a genome?

I wonder how much of this growth reflects the sad truth that many genomes that were published in the post-Sanger, pre-nanopore era (approximately 2005–2015) were just not very good. Many people rushed to adopt the powerful new sequencing technologies provided by Illumina and others, and many genomes have been published using those technologies that are now being given makeovers by applying newer sequencing, scaffolding, and mapping technologies

The updated pine genome (the last publication on the list above) says as much in its abstract (emphasis mine):

The 22-gigabase genome of loblolly pine (Pinus taeda) is one of the largest ever sequenced. The draft assembly published in 2014 was built entirely from short Illumina reads, with lengths ranging from 100 to 250 base pairs (bp). The assembly was quite fragmented, containing over 11 million contigs whose weighted average (N50) size was 8206 bp. To improve this result, we generated approximately 12-fold coverage in long reads using the Single Molecule Real Time sequencing technology developed at Pacific Biosciences. We assembled the long and short reads together using the MaSuRCA mega-reads assembly algorithm, which produced a substantially better assembly, P. taeda version 2.0. The new assembly has an N50 contig size of 25 361, more than three times as large as achieved in the original assembly, and an N50 scaffold size of 107 821, 61% larger than the previous assembly.

Perhaps I’m being a bit harsh in saying that the first versions of many of these genomes that have been subsequently improved were not very good. The more important lesson to bear in mind is that, in reality, a genome is never finished and that all published sequences represent ‘works in progress’.

Fun at the Festival: a mini-report of the 2018 Festival of Genomics in London

 Graffiti wall at the Festival of Genomics

Graffiti wall at the Festival of Genomics

Last week I once again attended the excellent Festival of Genomics conference in London. As before, I was attending in order to produce some coverage for The Institute of Cancer Research (where I work).

I really enjoy the mixture of talks at this conference which always has a strong leaning towards medicine in general and the rapid integration of genomics in the NHS in particular. This was a topic I explored in more detail in a blog post last year for the ICR.

I like how the conference organisers, Front Line Genomics, make an effort to ensure the conference is fun and engaging. It is easy to dismiss things like a 'grafitti walls' and 'recharge stations' (where you can power up your mobile phone and get a massage) as gimmicks, but I think they add to a feeling that this is a modern and vibrant conference.

NHS meets NGS

Opening the conference was a presentation from Sir Malcolm Grant, the Chairman of NHS England. He presented an update on Genomics England's 100,000 Genomes Project.

Sir Malcolm noted that consent is such an important part of this project as participants are consenting to provide information that may affect others, e.g. their children and heirs. He stressed the importance of ensuring public trust and support as the project moves forwards.

Although initial progress towards achieving those 100,000 genomes may have been slower than some would have liked, work has been accelerating. The project has taken almost five years to reach the halfway point but is now on course to reach the 100K milestone within the next 12 months.

The following day saw Genomics England's Chairman, Sir John Chisholm, take to the stage for a chat with Carl Smith (Managing Editor of Front Line Genomics). He stressed that people should think of the 100,000 genomes project as a "pilot for the main game", i.e. the routine sequencing of patients within the NHS.

Rigged for silent running

The conference has four 'stages' but as the whole area at the ExCel Area is just one big open space, they make use of wireless headphones to have conference areas which are effectively silent to people walking past.

In addition to headphones being left on each seat, there are also many additional headsets that can be given out to people who are just standing by the sides of the 'stages' to more casually listen in to each session.

When genomics meets radiotherapy

This year the ICR was honoured with our own conference session in which four early-career researchers talked about how they used genomics data in their own areas of cancer research.

I have writen a Science Talk blog post for the ICR that focuses on a presentation at the conference by Dr James Campbell, who is a Lead Bioinformatican at the ICR. He is using genome data from almost 2,000 patients that have undergone radiotherapy treatment for prostate cancer, in order to develop a model which predicts how well a new patient — given their particular set of genotypes and clinical factors — will respond to radiotherapy.

You can read the blog post here: