Making code available: lab websites vs GitHub vs Zenodo

October 16, 2015 by Keith Bradnam

Our 2007 CEGMA paper was received by the journal Bioinformatics on December 7, 2006. So this means it was about 9 years ago that we had to have the software on a website somewhere with a URL that we could use in a manuscript. This is what ended up in the paper:

Even though we don't want people to use CEGMA anymore, I'm at least happy that the link still works! I thought it would be prudent to make the paper link to a generic top-level page of our website (/Datasets) rather than to a specific page. This is because experience has taught me that websites often get reorganized, and pages can move.

If we were resubmitting the paper today and if we were still linking to our academic website, then I would probably suggest linking to the main website page (korflab.ucdavis.edu) and ensuring that the link to CEGMA (or 'Software') was easy to find. This would avoid any problems with moving/renaming pages.

However, even better would be to use a service like Zenodo to permanently archive the code; this would also give us a DOI for the software too. Posting code to repositories such as GitHub is (possibly) better than using your lab's website, but people can remove GitHub repositories! Zenodo can take a GitHub repository and make it (almost) permanent.

The Zenodo policies page makes it clear that even in the 'exceptional' event that a research object is removed, they still keep the DOI URL working and this page will state what happened to the code:

Withdrawal
If the uploaded research object must later be withdrawn, the reason for the withdrawal will be indicated on a tombstone page, which will henceforth be served in its place. Withdrawal is considered an exceptional action, which normally should be requested and fully justified by the original uploader. In any other circumstance reasonable attempts will be made to contact the original uploader to obtain consent. The DOI and the URL of the original object are retained.

There is a great guide on GitHub for how you can make your code citable and archive a repository with Zenodo.

The Bioboxes paper is now out of the box! [Link] →

October 15, 2015 by Keith Bradnam

The Bioboxes project now has their first formal publication, with the software being described today in the journal GigaScience:

Bioboxes: standardised containers for interchangeable bioinformatics software

I love the concise abstract:

Software is now both central and essential to modern biology, yet lack of availability, difficult installations, and complex user interfaces make software hard to obtain and use. Containerisation, as exemplified by the Docker platform, has the potential to solve the problems associated with sharing software. We propose bioboxes: containers with standardised interfaces to make bioinformatics software interchangeable.

Congratulations to Michael Barton, Peter Belmann, and the rest of the Bioboxes team!

Updated 2015-10-15 18.18: Added specific acknowledgement of Peter Belmann.

Should reviewers of bioinformatics software insist that some form of documentation is always included alongside the code?

October 15, 2015 by Keith Bradnam

Yesterday I gave out some JABBA awards and one recipient was a tool called HEALER. I found it disappointing that the webpage that hosts the HEALER software contains nothing but the raw C++ files (I also found it strange that none of the filenames contain the word 'HEALER'). This is what you would see if you go to the download page:

Today, Mick Watson alerted me to a piece of software called ScaffoldScaffolder. It's a somewhat unusual name, but I guess it at least avoids any ambiguity about what it does. Out of curiosity I went to the the website to look at the software and this is what I found:

Ah, but maybe there is some documentation inside that tar.gz file? Nope.

At the very least, I think it is good practice to include a README file alongside any software. Developers should remember that some people will end up on these software pages, not from reading the paper, but by following a link somewhere else. The landing page for your software should make the following things clear:

What is this software for?
Who made it?
How do I install it or get it running?
What license is the software distributed under?
What is the version of this software?

The last item can be important for enabling reproducible science. Give your software a version number — the ScaffoldScaffolder included a version number in the file name — or, at the very least, include a release date. Ideally, the landing page for your software should contain even more information:

Where to go for more help, e.g. a supplied PDF/text file, link to online documentation, or instructions about activating help from within the software
Contact email address(es)
Change log

This is something that I feel that reviewers of software-based manuscripts need be thinking about. In turn, this means that it is something that the relevant journals may wish to start including in the guidelines for their reviewers.

In need of some bogus bioinformatics acronyms? Try these on for size…

October 15, 2015 by Keith Bradnam

Three new JABBA awards for you to enjoy (not that anyone should be enjoying these crimes against acronyms). Hold on to your hats, because this might get ugly.

1: From the journal Scientific Reports

CADBURE: A generic tool to evaluate the performance of spliced aligners on RNA-Seq data

The paper helpfully points out that the name of this tool is pronounced 'Cadbury'. I'm not sure if they are saying this to invite a trademark complaint but generally I feel that it is never a good sign when someone has to tell you how to pronounce something. The acronym CADBURE is derived as follows:

Comparing Alignment results of user Data Based on the relative reliability of Uniquely aligned REads

On the plus side, CADBURE mostly uses the initial letters of words. On the negative side, only six out of the fifteen words contribute to the acronym and this is why it earns a JABBA award.

2: From the journal BMC Bioinformatics

SPINGO: a rapid species-classifier for microbial amplicon sequences

SPINGO is generated as follows:

Species-level IdentificatioN of metaGenOmic amplicons

So only two words contribute the initial letters, 'identification' donates its second N (but not its first), and 'amplicons' gives us nothing at all. Very JABBA-award worthy.

3: From the journal Bioinformatics (h/t to James Wasmuth)

HEALER: Homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS

It is very common for people to wait until the end of the Introduction before they reveal how the acronym/initialism in question came to be, but in this paper they don't waste any time at all…it's right there in the title.

It is the mark of a tenuously derived acronym when the initial letter of a word isn't used, but the same letter from a different position in the same word is used, e.g. the second 'R' of 'regression'.

While the code for HEALER is available online, none of the five C++ files contain the word HEALER in their name or anywhere in their code. Nor is there any form of README or accompanying documentation. This is all that you see…

Congratulations to our three worthy JABBA award winners!

101 questions with a bioinformatician #35: Aaron Darling

October 14, 2015 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Aaron Darling is an Associate Professor at the ithree institute — where capital letters are in short supply? — which is part of UTS (University of Technology Sydney). His research focuses on developing computational and molecular techniques to characterize the hidden world of microbes. He helped develop the Mauve multiple genome alignment tool and continues to work on this and other software tools. Aaron also has a long-standing interest in poop:

So jealous! I wanted to sequence fossil poop! Can't wait to read their paper: http://t.co/2hoid7J3
— Aaron Darling (@koadman) December 14, 2012

@CRIgenomics woohoo, i'm getting a MinION and i'm not even at #AGBT14 ! Can't wait to fill it with baby poo!
— Aaron Darling (@koadman) February 14, 2014

Another 💩 talk by @koadman #abacbs2015 #poo-ome
— Matthew Wakefield (@genomematt) October 10, 2015

Of course this interest is all part of an ongoing research project, one that is seeking to understand the development of the infant gut microbiome.

You can find out more about Aaron by visiting his lab's website, or by following him on twitter (@koadman). And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

The growing interplay between informatics, molecular biology, and experimental design is very exciting. In the past 10 years many problems that could only have been solved through decades of experimental work have been transformed from experimental problems to data analysis problems. I think this trend will only accelerate as our technology to interface digital computational systems with biological systems continues to improve. And data analysis feeds back to inspire new experimental designs in a feedback loop that's getting ever-shorter. As an informatician I find it especially fun to discover new ways of designing the lab work that solves long-standing data analysis problems.

010. What's something that you don't enjoy about current bioinformatics research?

Data wrangling and data mangling. This is almost certainly cliche by now but inconsistently implemented file formats are the bane of bioinformatics. This was apparent to me within weeks of starting in the field, as my first assigned task was to write a sequence file format parsing library for the E. coli genome project team. I often wonder why I didn't run as fast as I could in the opposite direction.

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Early on I benefited from a nugget of wisdom in Dan Gusfield's sequence analysis book which emphasized the importance of solving biological data analysis problems that are core to the biology, not the technology platform used to measure the biology. For example the general sequence alignment problem vs. short read alignment. Those are the contributions that are going to matter over the long term. I wish I had also appreciated early on that the elegance and simplicity of the solution, and especially the code implementing it, matters just as much.

100. What's your all-time favorite piece of bioinformatics software, and why?

Probably BEAST, because I learned so much about phylogenetic models, MCMC, and software design from using it and coding up modules for it.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

H, because as a teenager I always wanted to be a G but in reality was everything but.

We have not yet reached peak CEGMA

October 13, 2015 by Keith Bradnam

I was alerted to some disturbing news this weekend: CEGMA won't die!

CEGMA is a tool that I helped develop back in 2005. The first formal publication that describes CEGMA came out in 2007, and since then it has seen year-on-year growth in the number of citations to this paper.

I keep on thinking that this trend must end soon, and I was therefore hopeful that 2014 might have been the year of peak CEGMA. There were three reasons why I thought this might happen:

CEGMA is no longer being developed or supported
I have used the PubMed page for the CEGMA paper to advocate that people should no longer use this tool
CEGMA is heavily reliant on an — increasingly out-of-date — database of orthologs that was published in 2003

However, despite my best wishes, Google Scholar has revealed to me that 2015 has now seen more citations to the CEGMA paper than in any previous year:

CEGMA citation details from Google Scholar

I'm hopeful that the development of the BUSCO software by Felipe Simão et al. will mean that 2015 will definitely be the year of peak CEGMA!

The next BioNano Genomics Webinar is about improving genome assemblies with gEVAL

October 13, 2015 by Keith Bradnam

The next BioNano Genomics webinar will be October 27th, 2015 at 8:00 am (PST). The title of the webinar is:

gEVAL- A Genome Evaluation Browser for Improving Genome Assemblies

The gEVAL browser is managed by the Genome Reference Informatics Team (GRIT) at the Wellcome Trust Sanger Institute. William Chow, the lead developer of gEVAL will be leading the webinar.

You can register for the event here.

Financial disclaimer: I do not own shares in any biotechnology company.

Limited tickets still available for 'Bio in Docker' Symposium (November 2015)

October 13, 2015 by Keith Bradnam

This free symposium — organised by Kings College London and the Biomedical Research Centre (BRC) — will bring together people interested in using Docker images in the field of bioinformatics, and will include a 'mini-hackday' session.

Event description

Docker is now establishing itself as the de facto solution for containerization across a wide range of domains. The advantages are attractive, from reproducible research to simplifying deployment of complex code.

This event will bring together some notable cases to discuss how advantage of this new technology can best be achieved within the Bioinformatics space.

Event details

November 9–10, 2015
London, UK at the Wellcome Collection
Register through Eventbrite

Talk details

Peter Belmann (Bioboxes): i) Evaluating and ranking bioinformatics software using docker containers. ii) Overview of the BioBoxes project
Nebojsa Tijanic (SB Genomics): Portable workflow and tool descriptions with Common Workflow Language and Rabix
Paolo Di Tommaso (Nextflow / Notredame Lab, CRG): Manage reproducibility in genomics pipelines with Nextflow and Docker containers
Amos Folarin & Stephen Newhouse (NGSeasy/KCL): Next generation sequencing pipelines in Docker
Tim Hubbard (Genomics England / KCL): Pipelines to analysis data from the 100,000 genomes project as part of the Genomics England Clinical Interpretation Partnership (GeCIP)
Fabien Campagne (Campagne lab): MetaR and the Nextflow Workbench: application of Docker and language workbench technology to simplify bioinformatics training and data analysis.
Elijah Charles (Intel): Bioinformatics and the packaging melee
Brad Chapman (Blue Collar Bioinformatics): Improving support and distribution of validated analysis tools using Docker
Michael Ferranti / Kai Davenport (ClusterHQ): Data, Volumes and portability with Flocker
Ilya Dmitrichenko (Weave): Application-oriented networking with Weave
Aanand Prasad (Docker): Orchestrating Containers with Docker Compose

The most haplotyped place on Earth: 'DNA Land' is open for business!

October 12, 2015 by Keith Bradnam

DNA Land has opened! If you are curious what DNA Land is, well here is the concise description offered by the website:

DNA Land is a place where you can learn more about your genome while enabling scientists to make new genetic discoveries for the benefit of humanity. Our goal is to help members to interpret their data and to enable their contribution to research.

At the time I captured the above screenshot, the site boasted '2,483 genomes and counting'. At the time I started writing this piece it had already risen to '2,501' genomes. Erika Check Hayden gives a good overview of DNA Land in a Nature news item: Scientists hope to attract millions to 'DNA.LAND'.

So DNA Land is a place to learn more about your genome, which aims to attract millions of visitors, and where you can also earn badges. Hmm, makes me wonder whether I should wait for DNA World to open…especially if the lines are long.

'The amount of foil needed to wrap five breakfast sandwiches': a new metric for genomics?

October 10, 2015 by Keith Bradnam

The journal Genome Research is celebrating its 20th anniversary and has marked the occasion by issuing a number of 'perspective' articles. One of these — A vision for ubiquitous sequencing — includes one of the strangest comparisons that I've ever seen in the field of genomics (or really any field):

Back in 1990, sequencing 1 million nucleotides cost the equivalent of 15 tons of gold (adjusted to 1990 price). At that time, this amount of material was equivalent to the output of all United States gold mines combined over two weeks. Fast-forwarding to the present, sequencing 1 million nucleotides is equivalent to the value of ∼30 g of aluminum. This is approximately the amount of material needed to wrap five breakfast sandwiches at a New York City food cart.

Most people will understand the point that is being made here. Sequencing used to be really expensive whereas now it is very cheap. But is there really a need to explain what 30 grams of aluminum foil amounts to in a more, human-friendly, unit? And even if such a comparison is deemed necessary, is the use of 'breakfast sandwiches' from New York City food carts the most suitable choice?