It is clear that the number of publications referencing ‘complete’ genomes/assemblies has been increasing at a steady rate. In contrast, publications describing ’draft’ genomes have grown rapidly in the last decade but the rate of increase is slowing. When it comes to ‘improved’ genomes it looks like we are in a period where many more papers are being published that are describing improved versions of existing genomes (in 2017 there was a 54% increase in such papers compared to 2016).
Why improve a genome?
I wonder how much of this growth reflects the sad truth that many genomes that were published in the post-Sanger, pre-nanopore era (approximately 2005–2015) were just not very good. Many people rushed to adopt the powerful new sequencing technologies provided by Illumina and others, and many genomes have been published using those technologies that are now being given makeovers by applying newer sequencing, scaffolding, and mapping technologies
The updated pine genome (the last publication on the list above) says as much in its abstract (emphasis mine):
The 22-gigabase genome of loblolly pine (Pinus taeda) is one of the largest ever sequenced. The draft assembly published in 2014 was built entirely from short Illumina reads, with lengths ranging from 100 to 250 base pairs (bp). The assembly was quite fragmented, containing over 11 million contigs whose weighted average (N50) size was 8206 bp. To improve this result, we generated approximately 12-fold coverage in long reads using the Single Molecule Real Time sequencing technology developed at Pacific Biosciences. We assembled the long and short reads together using the MaSuRCA mega-reads assembly algorithm, which produced a substantially better assembly, P. taeda version 2.0. The new assembly has an N50 contig size of 25 361, more than three times as large as achieved in the original assembly, and an N50 scaffold size of 107 821, 61% larger than the previous assembly.
Perhaps I’m being a bit harsh in saying that the first versions of many of these genomes that have been subsequently improved were not very good. The more important lesson to bear in mind is that, in reality, a genome is never finished and that all published sequences represent ‘works in progress’.