Brief thoughts on Karyn Meltz Steinberg's ASHG 2015 talk on genome assembly improvement

I like it when people a) share their slides online and b) share their slides online soon after they give a talk somewhere. This is particularly helpful when want to quickly catch up on developments from a conference that you couldn't attend. Karyn Meltz Sternberg (@KMS_Meltzy on twitter) ticks both boxes because she posted her #ASHG2015 slides almost as soon as her talk finished. The title of her talk was:

Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing

Her slides — hosted on Slideshare — are embedded below.

What interested me from this talk is the use of sequence maps generated by the BioNano Genomics Irys platform to improve genome assemblies. This technology seems to be growing in popularity, offering an easier (and more powerful?) alternative to 'traditional' optical map solutions. This work is part of the McDonnell Genome Institute's Reference Genomes Improvement project, which includes the following — very laudable — aim:

  • We plan to identify and resolve issues (misassemblies, sequence errors, and gaps) within the current reference GRCh38.

I find it interesting that this project has also defined two levels of genome status:

Gold Genome: A high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions.

Platinum Genome: A contiguous, haplotype-resolved representation of the entire genome.

Not clear from these definitions whether platinum genomes can still include short regions of unknown bases (Ns). A figure on the Reference Genomes Improvement project page also hints at a 'Silver' status, making me think it it only a matter of time before we see the addition of a credit-card-esque 'diamond' status level: no unknown bases, with full representation of tandem repeat arrays, e.g. centromeres, and priority booking for VIP tickets at major sporting events.

This JABBA-award winning software wants a shot at redemption

A new tool has been described in the journal Bioinformatics:

All of the words that contribute to the name of this acronym are right there in the article's title. But as this is a JABBA-award-worthy name, we don't expect each word to contribute its first letter (or only one letter):

REDEMPTION: REduced Dimension Ensemble Modeling and Parameter estimaTION

This is certainly far from being the most bogus bioinformatics acronym that I have seen and — as far as I can tell — the name is unique (within the context of bioinformatics). However, I am particularly wary of tools that use a short name which a) has no obvious connection to what the software actually does and b) has potentially emotive associations in other contexts, e.g. religion and/or politics.

Front Line Genomics interview with Illumina CEO Jay Flatley includes a question from me

Issue 5 of the Front Line Genomics magazine is now online (in PDF format). The latest issue includes a fascinating interview with Jay Flatley, CEO of a little sequencing company you may have heard of (Illumina).

Continuing their trend of allowing former interviewees to ask questions, I was lucky enough to have one of my questions chosen for the interview. Here is my (somewhat lengthy) question along with Jay's response:

KB: When Apple introduced the original iPod in 2001, it was an expensive luxury ($400) that went on to change an entire industry. Remarkably, by 2006 the iPod product line was Apple’s largest source of revenue. Today, you can buy a cheap iPod knock-off for less than $20 and iPods now account for <1% data-preserve-html-node="true" of Apple’s revenue. Smart phones have made dedicated music players largely irrelevant.

So here’s my question...is the Illumina of 2015 like the Apple of 2006? What does Illumina do when, in five to ten years’ time, everyone will be getting his or her genomes sequenced and analyzed in an automated manner for less than $100? If the HiSeq platform is Illumina’s iPod, what’s going to be your iPhone?


JF: That’s a great question! We certainly do believe in the 5-10 year time frame that the ability to sequence a genome will be available to everyone because the economics will be there and the clinical utility will be there. That will be an enormous market opportunity.

The first thing I would say is, unlike the iPod which became a commodity because the actual technology could be replicated by other companies – especially the physical interface, the headphone jack and storage inside the iPod. Sequencing is quite challenging by comparison. It requires the intersection of a dramatically larger number of technologies which all have to work together in quite a complicated and sophisticated way. Having said that, we think that our sequencers need to become easier to use, need to become faster, need to become cheaper.

These are all things we’re working on. We obviously can’t layout ourroadmap for people today, but there will be technologies beyond the HiSeq, and those technologies will ultimately enable people to sequence their genomes at much lower prices than $1,000. The trick for Illumina, of course, is to be the company that introduces that technology, and brings the equivalent of the iPhone that largely replaced the iPod to market and that we don’t let someone else do that before we do it.

The full interview starts on page 20 of the PDF, and you can access previous issues of Front Line Genomics magazine here.

Ewan Birney reflects on the use of twitter and blogging for science communication [Link]

Worth reading. Ewan includes some comments regarding the growing use of pre-print platforms:

Blogging is nice, because it is accessible to a broader audience and allows for a more chatty, 'natural language' style – but if the main purpose is to communicate with scientists, pre-publication servers are a better way to go

Ewan singles out arXiv, bioRxiv, and F1000Research, but I think PeerJ are also worth a mention here. They also have their own pre-print server and they also encourage open peer review.

Additionally, I think figshare is another outlet that can be used for dissemination of science material that may not suitable for a peer reviewed publication. One cool thing about using figshare for posting preliminary data or commentary pieces is that articles are allocated a DataCite DOI and can therefore be cited.

BioNano Genomics are holding a webinar on October 12th about 'Hybrid Scaffolds' [Link]

6436076935708672.png

The Irys platform by BioNano Genomics seems to be a useful tool by which to help assess the completeness and contiguity of genome assemblies. I noticed today that they are holding an webinar which may be of interest some of the readers of this blog. From the webinar registration page:

Please join us for our Hybrid Scaffold webinar. We will discuss the applications of the software, which include the building of longer contigs, validation of existing contigs, and acquisition of novel map level information. Experimental design considerations will be reviewed for ideal data integration. We will walk through each stage of Hybrid Scaffold troubleshooting data optimization and pointing out files of interest.

 

Financial disclaimer: I do not own shares in any biotechnology company.

What a difference a day makes: markets react to PacBio's new sequencing platform

Daily change in share price at close of trade on October 1st, 2015:

Figures from Google Finance

Update: Earlier version of this figure incorrectly cited a 19% drop in Illumina's share price. My bad: -18.6 was the price change in dollars, not the percentage change.

 

Financial disclaimer: I do not own shares in any biotechnology company. 

How to sequence and assemble a large eukaryote genome with long reads in 2015 [Link]

If you have any interest in the latest methods of DNA sequencing and/or genome assembly, you really owe it to yourself to be following Lex Nederbragt's excellent In between lines of code blog. Today's post offers some useful advice:

Main advice: bite the bullet and get the budget to get 100x coverage in long PacBio reads. 50-60x is really the minimum.

Who is saying what about the new PacBio Sequel system?

The big news from the world of DNA sequencing this week was that Pacific Biosciences has launched a new sequencing platform. The successor to their RS II platform has been named The Sequel System and it will be on display at the upcoming American Society of Human Genetics meeting. The new system promises a cost of sequencing a human genome (at 10x coverage) for $3,000.

The early buzz already seems pretty positive, and hopefully this sequel will turn out to be more like The Empire Strike Back than, say, Highlander II. What follows is a fairly comprehensive roundup of what people have been saying about this new platform — note that this story has been updated several times since I first wrote it (details of these updates are included at the end of this post):

From PacBio

From science news websites

'Traditional' news outlets

From blogs

From discussion forums

From the world of finance

I guess the question that everyone is asking now concerns the possibility of someone making a genome assembly from sequence data using this platform, and then using this tool to produce a better version of the assembly. In this case, would it be a sequel Sequel SEQuel genome assembly?

 

Questions from the conference call

There were a lot of questions asked in the hour long conference call. I've transcribed some of them and indicated the time point where you can jump to if you are interested in hearing PacBio's answers to specific questions:

  • 7:40:"Can you give us some thoughts on turnaround time and cost per genome?"
  • 11:20:"Can you talk about the use case beyond your current customer base? How this expands the number of applications?"
  • 15:17:"Can you help us think about some of the major changes that went into the system? Is there still a manifold that moves in three dimensions?"
  • 19:20:"From a user standpoint, are there any changes to site preparations that you would have to make from Sequel vs RS II; any limitations on things like putting it on 2nd/3rd/4th floor?"
  • 22:25:"You've introduced a number of kits with various applications for the RS II, will the Sequel be able to run all of the applications from the beginning, or will it take time to introduce certain applications to the system?"
  • 24:34:"Are there specific customer types that you think are positioned to be more on the earlier side of adoption, such as human sequencing, or microbiology, plant, animal etc.?"
  • 33:20:"Can you give a perspective on what the scalability of this platform looks like comparatively (to the RS II)?"
  • 35:08:"In terms of the metrics you gave around price per human genome, can you help us think about that relative to Illumina? If you take a 30x coverage genome on Illumina, what is the equivalent coverage you would need on the Sequel to get something similar…and how long would that take you to do?"
  • 38:29:"Recognising a lot has been achieved with this launch: different computer architecture, different form factor, new optical systems, higher density, with a smaller footprint. I just want to make sure, there's no compromise in raw accuracy expected relative to the RS II?"
  • 47:46:"Could you describe in layman's terms the benefits of methylation detection for your system?"
  • 50:50:"With your technology relative to other platforms, can you help us understand — if you have these larger pieces of the puzzle if you will — how advantageous that could be after you're done generating data, when you get down to assembling the genome?"
  • 53:16:"I'm curious what percentage of potential customers that looked at the RS II passed given the high price tag? What is the incremental buyer opportunity at the price point of $350,000?"
  • 57:35:"Still trying to understand what percentage of competitive platforms you think you can swap out with the Sequel?"
 

Updates

2015-10-01 13.46: Added some more sources of news, including questions asked in conference call
2015-10-01 20.04: Added in more conference call details, with time points of different questions.
2015-10-01 20.39: Added Keith Robison's blog post
2015-10-02 06:34: Changed link for Bio-IT World's piece
2015-10-02 09.08: Added more links about PacBio's presentation at ASHG 2015
2015-10-02 09.41: Added link to CoreGenomics post and added disclaimer
2015-10-02 11.54: Added links to Sequel-related discussions on SEQanswers and reddit
2015-10-02 13.28: Added Biomusings and Checkmate Scientist blog posts, and split main part of article into different sections
2015-10-12 09.52: Addition of NBC Bay Area News piece
2015-10-14 16.57: Addition of 2nd GenomeWeb story
2015-10-23 20.02: Addition of 3rd GenomeWeb story

 

Financial disclaimer: I do not own shares in any biotechnology company.

Question: when is a GitHub repository not a GitHub repository?

Answer: when it doesn't contain any useful code.

Update 2015-10-02 08.58: this post was updated to reflect the addition of code the metaPORE repository.

A discussion on twitter today revealed something which I find very disappointing:

A new paper by Greninger et al. (Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis) has been published in the journal Genome Medicine. The Methods contains the following line:

We developed a custom bioinformatics pipeline for real-time pathogen identification and visualization from nanopore sequencing data (MetaPORE) (Fig. 1b), available under license from UCSF at [23].

Reference #23 takes you to the metaPORE GitHub repository. At the time I initially wrote this post — and as the screen grab below shows — it contained zero code. Thankfully this has been changed and a set of Python and shell scripts are now available.

Maybe this was just some sort of error in scheduling the release of the paper and the code. However, journals and authors should understand that if a paper (or a pre-print) appears online and points to a code repository (or any other website), the expectation is that people should be able to visit the site in question and download code.