JABBA vs Jabba: when is software not really software?

It was only a matter of time I guess. Today I was alerted to a new publication by Simon Cockell (@sjcockell), it's a book chapter titled:

From the abstract:

Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data

Now as far as I can tell, this Jabba is not an acronym, so we safely avoid the issue of presenting a JABBA award for Jabba. However, one might argue that naming any bioinformatics software 'Jabba' is going to present some problems because this is what happens when you search Google for 'Jabba bioinformatics'.

There is a bigger issue with this paper that I'd like to address though. It is extremely disappointing to read a software bioinformatics paper in the year 2015 and not find any explicit link to the software. The publication includes a link to http://www.ibcn.intec.ugent.be, but only as part of the author details. This web page is for the Internet Based Communication Networks and Services research group at the University of Gent. The page contains no mention of Jabba, nor does their 'Facilities and Tools' page, nor does searching their site for Jabba.

Initially I wondered if this is paper is more about the algorithm behind Jabba (equations are provided) and not about an actual software implementation. However, the paper includes results from their Jabba tool in comparison to another piece of software (LoRDEC) and includes details of CPU time and memory requirements. This suggests that the Jabba software exists somewhere.

To me this is an example of 'closed science' and represents a failure of whoever reviewed this article. I will email the authors to find out if the software exists anywhere…it's a crazy idea but maybe they might be interested if people could, you know, use their software.

Update 2015-11-20: I heard back from the authors…the Jabba software is on GitHub.

ACGT is now AFCW (Approved for Free Cultural Works): thoughts on switching to a CC-BY license

This website, as well as my personal website and Rescued by Code, licenses material under a Creative Commons license. Specifically, I've been using the Attribution Non-Commerical license, popularly known as CC BY-NC. My joint venture with Abby Yu, The Take-Home Message web comic, has been even more restrictive and has been licensing content under the Attribution Non-Commercial Share-Alike license (CC BY-NC-SA).

These choice of licenses is something that's been on my mind for a while. I've known that I'm not being as open as I could be and maybe this has stemmed from an unwarranted (not to mention unlikely) fear that someone would take all my blog posts and somehow seek to profit from them.

Today I saw a tweet by Rogier Kievit (@rogierK) that has helped me change my mind:

I found the third link — something that is now over a decade old — particularly persuasive and accordingly I have switched all of my website licenses to CC-BY. Apparently this means that all of my writings now fall into the category of Free Cultural Works. I am grateful to Abby Yu to agreeing to this change for The Take-Home Message.

This change also means that someone can now use my blog posts to write the definitive book on JABBA-awards…just as long as they give me appropriate attribution.

10 years of Open Access at the Wellcome Trust in 10 numbers [Link]

A great summary of how the Wellcome Trust has helped drive big changes in open access publishing. Of the ten numbers that the post uses to summarise the last decade, this one surprised me the most:

20% – the volume of UK-funded research which is freely available at the time of publication
A recent study commissioned by Universities UK found that 20% of articles authored by UK researchers and published in the last two years were freely accessible upon publication. This figure increases to 24% within six months of publication, and 32% within 12 months.

If you had asked me to guess what this number would be, I think I would have been far too optimistic. Even the figure of 32% of articles being free within 12 months seems lower than I would imagine. Lots of progress still to be made!

ORCID: binding the (academic) galaxy together

Adapted from picture by flickr user Jim & Rachel McArthur

I am a supporter of ORCID's goals to help establish unique identifiers for researchers. Such identifiers can then be used to help connect a researcher with all of their inputs and outputs that surround their career. Most fundamentally, these inputs and outputs are grants and papers, but there is the potential for ORCID identifiers to link a person to much more, e.g. the organisations that they work for, manuscript reviews, code repositories, published slides, even blog posts.

For ORCID to succeed it has to be global and connect all parts of the academic network, a network that spans national boundaries. On this point, I am very impressed by the effort that ORCID makes in ensuring that their excellent outreach materials are not only available in English. As shown below, ORCID's 'Distinguish yourself' flyer is available in 9 different languages. Other material is also available in Russian, Greek, Turkish, and Danish. If your desired language is not available, they welcome volunteers to help translate their message into more languages. Email community@orcid.org if you want to help.

Question: when is a GitHub repository not a GitHub repository?

Answer: when it doesn't contain any useful code.

Update 2015-10-02 08.58: this post was updated to reflect the addition of code the metaPORE repository.

A discussion on twitter today revealed something which I find very disappointing:

A new paper by Greninger et al. (Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis) has been published in the journal Genome Medicine. The Methods contains the following line:

We developed a custom bioinformatics pipeline for real-time pathogen identification and visualization from nanopore sequencing data (MetaPORE) (Fig. 1b), available under license from UCSF at [23].

Reference #23 takes you to the metaPORE GitHub repository. At the time I initially wrote this post — and as the screen grab below shows — it contained zero code. Thankfully this has been changed and a set of Python and shell scripts are now available.

Maybe this was just some sort of error in scheduling the release of the paper and the code. However, journals and authors should understand that if a paper (or a pre-print) appears online and points to a code repository (or any other website), the expectation is that people should be able to visit the site in question and download code.

The Francis Crick Institute has signed the Hague Declaration

The Hague Declaration is an important manifesto that aims to provide guidelines for how to "best enable access to facts, data and ideas for knowledge discovery in the Digital Age". Although signatories to the declaration include large scientific research institutes, you can also sign the declaration as an invididual. The five main principles of the declaration are summarized as follows:

  1. Intellectual property was not designed to regulate the free flow of facts, data and ideas, but has as a key objective the promotion of research activity
  2. People should have the freedom to analyse and pursue intellectual curiosity without fear of monitoring or repercussions
  3. Licenses and contract terms should not restrict individuals from using facts, data and ideas
  4. Ethics around the use of content mining techniques will need to continue to evolve in response to changing technology
  5. Innovation and commercial research based on the use of facts, data, and ideas should not be restricted by intellectual property law

These principles are obviously of huge relevance for the field of genomics which seems to be generating tools and data at an ever increasing rate. So I was happy to read today that the new Francis Crick Institute in London is one of the Declaration's latest signatories:

"The large amounts of data and information that are now becoming available represent an extraordinary resource for researchers. By signing the Hague Declaration the Francis Crick Institute is expressing its support for the idea that researchers should be able to mine such content freely, thereby to advance knowledge and to promote Discovery without Boundaries."

Jim Smith, Director of Research at the Francis Crick Institute

The Assemblathon Gives Back (a bit like The Empire Strikes Back, but with fewer lightsabers)

So we won an award for Open Data. Aside from a nice-looking slab of glass that is weighty enough to hold down all of the papers that someone with a low K-index has published, the award also comes with a cash prize.

Naturally, my first instinct was to find the nearest sculptor and request that they chisel a 20 foot recreation of my brain out of Swedish green marble. However, this prize has been — somewhat annoyingly — awarded to all of the Assemblathon 2 co-authors.

While we could split the cash prize 92 ways, this would probably only leave us with enough money to buy a packet of pork scratchings each (which is not such a bad thing if you are fan of salty, fatty, porcine goodness).

Instead we decided — and by 'we', I'm really talking about 'me' — to give that money back to the community. Not literally of course…though the idea of throwing a wad of cash into the air at an ISMB meeting is appealing.

Rather, we have worked with the fine folks at BioMed Central (that's BMC to those of us in the know), to pay for two waivers that will cover the cost of Article Processing Charges (that's APCs to those of us in the know). We decided that these will be awarded to papers in a few select categories relating to 'omics' assembly, Assemblathon-like contests, and things to do with 'Open Data' (sadly, papers that relate to 'pork scratchings' are not eligible).

We are calling this event the Assemblathon 'Publish For Free' Contest (that's APFFC to those of us in the know), and you can read all of the boring details and contest rules on the Assemblathon website.

Winning an award that shouldn't exist: progress towards 'open data' and 'open science'

It was announced yesterday that the Assemblathon 2 paper has won the 2013 BioMed Central award for ‘Open Data’ (sponsored by Lab Archives). For more details on this see here and here.

While it is flattering to be recognized for our efforts to conduct science transparently, it still feels a little strange that we need to have awards for this kind of thing. All data that results from publicly funded science research should be open data. Although I feel there is growing support for the open science movement, much still needs to be done.

One of the things that needs to become commonplace is for scientists to put their data and code in stable, online repositories, that are hopefully citable as independent resources (i.e. with a DOI). For too long, people have used their lab websites as the end point for all of their (non-sequence[1]) related data (something that I have also been guilty of).

Part of the problem is that even when you take steps to submit data to an online repository of some kind, not all journals allow you to cite them. This tweet by Vince Bufflo from yesterday illustrated one such issue (see this Storify page for more details of the resulting discussion):


Tools like arXiv.org, BioRxiv, Figshare, Slideshare, GitHub, and GigaDB are making it easier to make our data, code, presentations, and preliminary results more available to others. I hope that we see more innovation in this area and I hope that more people take an ‘open’ approach to other aspects of science, not just the sharing of data[2]. Luckily, with people around like Jonathan Eisen and C. Titus Brown, we have some great role models for how to do this.

How will we know when we are all good practitioners of open science? When we no longer need to give out awards to people just for doing what we should all be doing.


  1. For the most part, journals require authors to submit nucleotide and protein sequences to an INSDC database, though this doesn’t always happen.  ↩

  2. I have written elsewhere about the steps that the Assemblathon 2 took to try to be open throughout the whole process of doing the science, writing the paper, and communicating the results.  ↩