101 questions with a bioinformatician #35: Aaron Darling

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Aaron Darling is an Associate Professor at the ithree institute — where capital letters are in short supply? — which is part of UTS (University of Technology Sydney). His research focuses on developing computational and molecular techniques to characterize the hidden world of microbes. He helped develop the Mauve multiple genome alignment tool and continues to work on this and other software tools. Aaron also has a long-standing interest in poop:

Of course this interest is all part of an ongoing research project, one that is seeking to understand the development of the infant gut microbiome.

You can find out more about Aaron by visiting his lab's website, or by following him on twitter (@koadman). And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

The growing interplay between informatics, molecular biology, and experimental design is very exciting. In the past 10 years many problems that could only have been solved through decades of experimental work have been transformed from experimental problems to data analysis problems. I think this trend will only accelerate as our technology to interface digital computational systems with biological systems continues to improve. And data analysis feeds back to inspire new experimental designs in a feedback loop that's getting ever-shorter. As an informatician I find it especially fun to discover new ways of designing the lab work that solves long-standing data analysis problems.



010. What's something that you don't enjoy about current bioinformatics research?

Data wrangling and data mangling. This is almost certainly cliche by now but inconsistently implemented file formats are the bane of bioinformatics. This was apparent to me within weeks of starting in the field, as my first assigned task was to write a sequence file format parsing library for the E. coli genome project team. I often wonder why I didn't run as fast as I could in the opposite direction.



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Early on I benefited from a nugget of wisdom in Dan Gusfield's sequence analysis book which emphasized the importance of solving biological data analysis problems that are core to the biology, not the technology platform used to measure the biology. For example the general sequence alignment problem vs. short read alignment. Those are the contributions that are going to matter over the long term. I wish I had also appreciated early on that the elegance and simplicity of the solution, and especially the code implementing it, matters just as much.



100. What's your all-time favorite piece of bioinformatics software, and why?

Probably BEAST, because I learned so much about phylogenetic models, MCMC, and software design from using it and coding up modules for it.



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

H, because as a teenager I always wanted to be a G but in reality was everything but.

We have not yet reached peak CEGMA

I was alerted to some disturbing news this weekend: CEGMA won't die!

CEGMA is a tool that I helped develop back in 2005. The first formal publication that describes CEGMA came out in 2007, and since then it has seen year-on-year growth in the number of citations to this paper.

I keep on thinking that this trend must end soon, and I was therefore hopeful that 2014 might have been the year of peak CEGMA. There were three reasons why I thought this might happen:

  1. CEGMA is no longer being developed or supported
  2. I have used the PubMed page for the CEGMA paper to advocate that people should no longer use this tool
  3. CEGMA is heavily reliant on an — increasingly out-of-date — database of orthologs that was published in 2003

However, despite my best wishes, Google Scholar has revealed to me that 2015 has now seen more citations to the CEGMA paper than in any previous year:

CEGMA citation details from Google Scholar

I'm hopeful that the development of the BUSCO software by Felipe Simão et al. will mean that 2015 will definitely be the year of peak CEGMA!

The next BioNano Genomics Webinar is about improving genome assemblies with gEVAL

The next BioNano Genomics webinar will be October 27th, 2015 at 8:00 am (PST). The title of the webinar is:

gEVAL- A Genome Evaluation Browser for Improving Genome Assemblies

The gEVAL browser is managed by the Genome Reference Informatics Team (GRIT) at the Wellcome Trust Sanger Institute. William Chow, the lead developer of gEVAL will be leading the webinar.

You can register for the event here.

 

Financial disclaimer: I do not own shares in any biotechnology company.

Limited tickets still available for 'Bio in Docker' Symposium (November 2015)

This free symposium — organised by Kings College London and the Biomedical Research Centre (BRC) — will bring together people interested in using Docker images in the field of bioinformatics, and will include a 'mini-hackday' session.

Event description

Docker is now establishing itself as the de facto solution for containerization across a wide range of domains. The advantages are attractive, from reproducible research to simplifying deployment of complex code.

This event will bring together some notable cases to discuss how advantage of this new technology can best be achieved within the Bioinformatics space.

Event details

  • November 9–10, 2015
  • London, UK at the Wellcome Collection
  • Register through Eventbrite

Talk details

  • Peter Belmann (Bioboxes): i) Evaluating and ranking bioinformatics software using docker containers. ii) Overview of the BioBoxes project
  • Nebojsa Tijanic (SB Genomics): Portable workflow and tool descriptions with Common Workflow Language and Rabix
  • Paolo Di Tommaso (Nextflow / Notredame Lab, CRG): Manage reproducibility in genomics pipelines with Nextflow and Docker containers
  • Amos Folarin & Stephen Newhouse (NGSeasy/KCL): Next generation sequencing pipelines in Docker
  • Tim Hubbard (Genomics England / KCL): Pipelines to analysis data from the 100,000 genomes project as part of the Genomics England Clinical Interpretation Partnership (GeCIP)
  • Fabien Campagne (Campagne lab): MetaR and the Nextflow Workbench: application of Docker and language workbench technology to simplify bioinformatics training and data analysis.
  • Elijah Charles (Intel): Bioinformatics and the packaging melee
  • Brad Chapman (Blue Collar Bioinformatics): Improving support and distribution of validated analysis tools using Docker
  • Michael Ferranti / Kai Davenport (ClusterHQ): Data, Volumes and portability with Flocker
  • Ilya Dmitrichenko (Weave): Application-oriented networking with Weave
  • Aanand Prasad (Docker): Orchestrating Containers with Docker Compose

The most haplotyped place on Earth: 'DNA Land' is open for business!

DNA Land has opened! If you are curious what DNA Land is, well here is the concise description offered by the website:

DNA Land is a place where you can learn more about your genome while enabling scientists to make new genetic discoveries for the benefit of humanity. Our goal is to help members to interpret their data and to enable their contribution to research.

At the time I captured the above screenshot, the site boasted '2,483 genomes and counting'. At the time I started writing this piece it had already risen to '2,501' genomes. Erika Check Hayden gives a good overview of DNA Land in a Nature news item: Scientists hope to attract millions to 'DNA.LAND'.

So DNA Land is a place to learn more about your genome, which aims to attract millions of visitors, and where you can also earn badges. Hmm, makes me wonder whether I should wait for DNA World to open…especially if the lines are long.

'The amount of foil needed to wrap five breakfast sandwiches': a new metric for genomics?

Photo by Robyn Lee for seriouseats.com

The journal Genome Research is celebrating its 20th anniversary and has marked the occasion by issuing a number of 'perspective' articles. One of these — A vision for ubiquitous sequencing — includes one of the strangest comparisons that I've ever seen in the field of genomics (or really any field):

Back in 1990, sequencing 1 million nucleotides cost the equivalent of 15 tons of gold (adjusted to 1990 price). At that time, this amount of material was equivalent to the output of all United States gold mines combined over two weeks. Fast-forwarding to the present, sequencing 1 million nucleotides is equivalent to the value of ∼30 g of aluminum. This is approximately the amount of material needed to wrap five breakfast sandwiches at a New York City food cart.

Most people will understand the point that is being made here. Sequencing used to be really expensive whereas now it is very cheap. But is there really a need to explain what 30 grams of aluminum foil amounts to in a more, human-friendly, unit? And even if such a comparison is deemed necessary, is the use of 'breakfast sandwiches' from New York City food carts the most suitable choice?

Brief thoughts on Karyn Meltz Steinberg's ASHG 2015 talk on genome assembly improvement

I like it when people a) share their slides online and b) share their slides online soon after they give a talk somewhere. This is particularly helpful when want to quickly catch up on developments from a conference that you couldn't attend. Karyn Meltz Sternberg (@KMS_Meltzy on twitter) ticks both boxes because she posted her #ASHG2015 slides almost as soon as her talk finished. The title of her talk was:

Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing

Her slides — hosted on Slideshare — are embedded below.

What interested me from this talk is the use of sequence maps generated by the BioNano Genomics Irys platform to improve genome assemblies. This technology seems to be growing in popularity, offering an easier (and more powerful?) alternative to 'traditional' optical map solutions. This work is part of the McDonnell Genome Institute's Reference Genomes Improvement project, which includes the following — very laudable — aim:

  • We plan to identify and resolve issues (misassemblies, sequence errors, and gaps) within the current reference GRCh38.

I find it interesting that this project has also defined two levels of genome status:

Gold Genome: A high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions.

Platinum Genome: A contiguous, haplotype-resolved representation of the entire genome.

Not clear from these definitions whether platinum genomes can still include short regions of unknown bases (Ns). A figure on the Reference Genomes Improvement project page also hints at a 'Silver' status, making me think it it only a matter of time before we see the addition of a credit-card-esque 'diamond' status level: no unknown bases, with full representation of tandem repeat arrays, e.g. centromeres, and priority booking for VIP tickets at major sporting events.

This JABBA-award winning software wants a shot at redemption

A new tool has been described in the journal Bioinformatics:

All of the words that contribute to the name of this acronym are right there in the article's title. But as this is a JABBA-award-worthy name, we don't expect each word to contribute its first letter (or only one letter):

REDEMPTION: REduced Dimension Ensemble Modeling and Parameter estimaTION

This is certainly far from being the most bogus bioinformatics acronym that I have seen and — as far as I can tell — the name is unique (within the context of bioinformatics). However, I am particularly wary of tools that use a short name which a) has no obvious connection to what the software actually does and b) has potentially emotive associations in other contexts, e.g. religion and/or politics.

Front Line Genomics interview with Illumina CEO Jay Flatley includes a question from me

Issue 5 of the Front Line Genomics magazine is now online (in PDF format). The latest issue includes a fascinating interview with Jay Flatley, CEO of a little sequencing company you may have heard of (Illumina).

Continuing their trend of allowing former interviewees to ask questions, I was lucky enough to have one of my questions chosen for the interview. Here is my (somewhat lengthy) question along with Jay's response:

KB: When Apple introduced the original iPod in 2001, it was an expensive luxury ($400) that went on to change an entire industry. Remarkably, by 2006 the iPod product line was Apple’s largest source of revenue. Today, you can buy a cheap iPod knock-off for less than $20 and iPods now account for <1% data-preserve-html-node="true" of Apple’s revenue. Smart phones have made dedicated music players largely irrelevant.

So here’s my question...is the Illumina of 2015 like the Apple of 2006? What does Illumina do when, in five to ten years’ time, everyone will be getting his or her genomes sequenced and analyzed in an automated manner for less than $100? If the HiSeq platform is Illumina’s iPod, what’s going to be your iPhone?


JF: That’s a great question! We certainly do believe in the 5-10 year time frame that the ability to sequence a genome will be available to everyone because the economics will be there and the clinical utility will be there. That will be an enormous market opportunity.

The first thing I would say is, unlike the iPod which became a commodity because the actual technology could be replicated by other companies – especially the physical interface, the headphone jack and storage inside the iPod. Sequencing is quite challenging by comparison. It requires the intersection of a dramatically larger number of technologies which all have to work together in quite a complicated and sophisticated way. Having said that, we think that our sequencers need to become easier to use, need to become faster, need to become cheaper.

These are all things we’re working on. We obviously can’t layout ourroadmap for people today, but there will be technologies beyond the HiSeq, and those technologies will ultimately enable people to sequence their genomes at much lower prices than $1,000. The trick for Illumina, of course, is to be the company that introduces that technology, and brings the equivalent of the iPhone that largely replaced the iPod to market and that we don’t let someone else do that before we do it.

The full interview starts on page 20 of the PDF, and you can access previous issues of Front Line Genomics magazine here.

Ewan Birney reflects on the use of twitter and blogging for science communication [Link]

Worth reading. Ewan includes some comments regarding the growing use of pre-print platforms:

Blogging is nice, because it is accessible to a broader audience and allows for a more chatty, 'natural language' style – but if the main purpose is to communicate with scientists, pre-publication servers are a better way to go

Ewan singles out arXiv, bioRxiv, and F1000Research, but I think PeerJ are also worth a mention here. They also have their own pre-print server and they also encourage open peer review.

Additionally, I think figshare is another outlet that can be used for dissemination of science material that may not suitable for a peer reviewed publication. One cool thing about using figshare for posting preliminary data or commentary pieces is that articles are allocated a DataCite DOI and can therefore be cited.