BioDocker and BioBoxes: the containerization of bioinformatics

August 26, 2015 by Keith Bradnam

Thanks to a post on the BioCode's Notes blog I have discovered that there is a project called BioDocker which aims to generate lots of Docker containers to help make bioinformatics more reproducible by standardizing how bioinformatics software is packaged. From the BioDocker website:

The main purpose of this project is to spread the use of Docker on the Bioinformatics and Computational Biology areas. By using pre-configured containers with different bioinformatic softwares some critical aspects of Bioinformatics like reproducibility are minimized. Here you will find a list of containers with different bioinformatics software and how to use it.

BioDocker was created by Felipe da Veiga Leprevost in 2014, and the associated GitHub repository currently has a dozen or so containers.

When I was first read about BioDocker I was confused because I know that there is also the Bioboxes project which aims to er…make bioinformatics more reproducible by standardizing how bioinformatics software is packaged. From the Bioboxes manifesto:

Software has proliferated in bioinformatics and so have the problems associated with it: missing or unobtainable code, difficult to install dependencies, unreproducible workflows, all with terrible user experiences. We believe a community standard, using software containers, has the opportunity to solve these problems and increase the standard of scientific software as a whole.

I think the aims of these two projects are similar, but not identical and Bioboxes probably has a broader remit. Both projects are aware of each other and it looks like they have had some productive exchanges.

All of this makes me feel that the bioinformatics community seems to be slowly, but steadily, embracing Docker. Any approaches to standardize how we do bioinformatics should be welcomed, but some of us with long memories will recall that we have been in this situation before. Anyone remember the promises of how CORBA and then SOAP were going to increase interoperability in bioinformatics?

The name of this bioinformatics tool merits close inspection

August 25, 2015 by Keith Bradnam

Bogus bioinformatics acronyms = mildly annoying
Names that clash with previouly published tools = mildly annoying
Bogus bioinformatics acronyms that clash with previouly published tools = very annoying

Step forward a new paper published in journal of Bioinformatics:

INSPEcT: a computational tool to infer mRNA synthesis, processing and degradation dynamics from RNA- and 4sU-seq time course experiments

How does INSPEcT derive its name?

INSPEcT (INference of Synthesis, Processing and dEgradation rates in Time-course analysis)

Inclusion of the 'E' from 'degradation' and omission of 'R', 'C', or 'A' (from 'Rates', 'Course', and 'Analysis') earns this tool a JABBA award. It also earns a 'Duplications' award because of:

Bioinformatics is just like bench science and should be treated as such →

August 24, 2015 by Keith Bradnam

A great post by Richard Edwards on his Cabbages of Doom blog, which includes a list of 8 shocking ways that bioinformatics is just like bench science. Highly recommended reading. His conclusion bears repeating here:

Bioinformatics is science. Full stop. It is no better than other science. It is no worse than other science. People do it right. People do it wrong.

Awkward Bioinformatics Conversations #1

August 21, 2015 by Keith Bradnam

Bob: Hi Sue, it's Bob. Got a favor to ask. Could you load up the UCSC Genome Browser site in your web browser please?

Sue: Hi Bob. So just to clarify…do you want me to load the UCSC Genome Browser homepage, or the UCSC Genome Browser website Genome Browser tool?

Bob: Wait, what?

Sue: UCSC Genome Browser is both the name of the website — as identified in their HTML metadata — and also the name of a tool on that website.

Bob: Er, just go to the main website first.

Sue: Done.

Bob: So it says that it's the UCSC Genome Browser website?

Sue: Yes. And no.

Bob: Huh?

Sue: The homepage identifies itself as the UCSC Genome Bioinformatics site but it also welcomes you to the UCSC Genome Browser website.

Bob: Okay, that sounds like you're looking at the right page then. So can you now please click on the Genome Browser link at the top of the page?

Sue: Which one?

Bob: What? Er, the one in the toolbar I guess.

Sue: Which one?

Bob: I just told you which one.

Sue: No, which toolbar?

Bob: There's more than one?

Sue: There's the horizontal toolbar that mostly contains dropdown menus with links that expose most of the site's functionality…and then there's the vertical toolbar which mostly offers links to items that also exist in sections of the horizontal toolbar.

Bob: But surely there's only one toolbar link which says 'Genome Browser'?

Sue: No, there are two.

Bob: But they go to the same place, right?

Sue: No. The horizontal toolbar link for 'Genome Browser' takes you into the Genome Browser tool with data loaded for the human genome assembly. The vertical toolbar link for 'Genome Browser' takes you to an intermediate page that lets you access the 'Human Genome Browser gateway'. Which one do you want? Bob? Hello???

Shining a light on more bogus bioinformatics acronyms

August 20, 2015 by Keith Bradnam

Courtesy of an anonymous tip off…

There is a new bioinformatics tool that was described in a recently published BMC Genomics article. Here is the full name of the tool with any capitalization removed:

automated tool for bacterial genome annotation comparison

So can you guess what acronym/name was extracted from this description?

ATBGAC?
AutoBGA
BGAC?

No. The JABBA-award-winning name of this tool is as follows:

BEACON: automated tool for Bacterial gEnome Annotation ComparisON

This name really isn't helped by the fact that it is shown as follows in the journal article title (with the G of 'Genome' also capitalized):

BEACON: automated tool for Bacterial GEnome Annotation ComparisON

101 questions with a bioinformatician #31: Morgan Taschuk

August 19, 2015 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Morgan Taschuk is a Senior Manager for Genome Sequence Informatics at the Ontario Institute for Cancer Research (OICR). She manages the production sequence analysis team to analyse all of the sequence data sequenced at OICR, resulting in the generation of alignment files, variant calls, QC metrics and other bountiful amounts of sequence data for OICR researchers and collaborators.

She recently wrote a great blog post regarding the (sometimes contentious) issue of Biologists vs Bioinformaticians. Definitely worth a read. Morgan has also recently started to assemble a Twitter list of Women in Bioinformatics, now up to 179 members. I'm sure she would like to make that list even longer, so please let her know of any omissions.

You can find out more about Morgan by visiting her Modern Model Organism blog, or by following her on twitter (@morgantaschuk). And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

There's always something more to learn. I'm spending a lot of time with our genomics lab recently and learning about how lab processes impact our data fascinates me. Bioinformatics skills are usually in demand so I also get to work with a wide variety of people with different questions and problems and have to stretch my brain to apply myself.

010. What's something that you don't enjoy about current bioinformatics research?

Often people write their own scripts or software instead of looking for something that already exists out there. Not only is it wasted effort for very similar results, it sabotages any attempt to standardize across the field. Open-source software is there for everyone to change and improve. Why not build on a foundation instead of digging the hole yourself?

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Since nobody can tell you what bioinformatics is, it's up to you to define it. I spent a long time fighting with imposter syndrome, not just because I felt inadequate but also because I was called a bioinformatician when I didn't fit the classical model. Nobody fits the classical model these days. Thinking about this question actually inspired me to write a blog post about the difference between bioinformaticians and computational biologists. Judging from the feedback on Twitter and the blog, the problem of defining what a bioinformatician is still really sticks in people's throats.

100. What's your all-time favorite piece of bioinformatics software, and why?

SAMtools. It's an amazing piece of very stable, utilitarian, open source code that forms the backbone of most sequencing pipelines.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

I struggled the most with this question! Y, because 'pyrimidine' is a pretty word and so Y not.

~crickets~

The names of bioinformatics tools that help study evolution shouldn't feel that they also have to evolve

August 18, 2015 by Keith Bradnam

Thanks to Torsten Seemann for bringing this to my attention…

In 2003 a bioinformatics tool was published. A tool with a thoroughly sensible name and acronym:

BIBI: a Bioinformatics Bacterial Identification Tool

A simple name with a simple, and not-too-bogus. initialism. Bravo. However, a subsequent update to BIBI brought about a change to the name:

^le BIBI:

Where the 'le' refers to 'light edition'. It should be said that most references to this tool drop the superscript notation for 'le'. Let's move forward to the present day and the publication of another version of this tool:

leBIBI ^QBPP : a set of databases and a webtool for automatic phylogenetic analysis of prokaryotic sequences

The full expansion of this new name is as follows:

Light Edition Bioinformatics Bacterial Identification Tool Quick Bioinformatic Phylogeny of Prokaryotes

Quite a mouthful! Bonus points for including 'Bioinformatics' and 'Bioinformatic' as part of the same name, as well as the largely redundant inclusion of 'Bacterial' as well as 'Prokaryotes'.

Generally I find use of superscript in software names to be largely unnecessary. It can make the tool name harder to read and it is unlikely to reproduced verbatim by others who mention your software. Starting your software with a lowercase letter also means that this might appear in uppercase if used to start a sentence (as happens several times in the above paper). Not a terrible problem but it reduces the strength and consistency of your 'bioinformatics brand'.

New JABBA award for Just Another Bogus Bioinformatics Acronym

August 13, 2015 by Keith Bradnam

Here's a new tool that was described recently in the journal of Bioinformatics:

GREGOR: evaluating global enrichment of trait-associated variants in epigenomic features using a systematic, data-driven approach

Let's see how the letters in 'GREGOR' are derived:

G = Genomic — a solid start
R = Regulatory — fine
E = Elements — all good so far
'and' — okay, we'll allow a conjunction or two
G = Gwas — hmm, including an acronym/initialism inside another acronym is rarely a good idea
O = Overlap — that's fine, time for the big finish…
R = algoRithm — oh come on!

Congratulations GREGOR — or should I say GREGOA? — you win a JABBA award!

If you want your bioinformatics software to have a memorable name, it helps if the name is pronounceable

August 12, 2015 by Keith Bradnam

There is a new paper in the journal Bioinformatics:

Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data

The paper describes a new method for implementing a Principle Components Analysis (PCA) of data. That new method has a name. That name has just seven characters. How hard can it be to pronounce?

S4VDPCA: ess-four-vee-dee-pee-cee-ay

It doesn't exactly trip off the tongue and having four 'ee-sounding' letters together (VDPC) doesn't make it easy to remember. When I first came across this paper, I skimmed the article, waited an hour, and then tried to remember the name. I could remember that it included '4', 'V', and 'D', but couldn't remember the order (or that it also included an 'S')

It is by no means essential that bioinformatics tools have easily pronounceable names, but this will help people remember the name of your software. In turn, this makes it easier for people to tell others about your software. I don't imagine that bioinformatics software developers ever want to overhear the following type of conversation:

Bob: "You should use that tool"

Sue: "What tool?"

Bob: "Umm, you know that PCA thingy. The S…something, something…PCA tool"

Sue: "The what?"

Bob: "Run a Google search for Bioinformatics PCA tools, it's probably the top hit."

Sue: <- facepalm ->

The Francis Crick Institute has signed the Hague Declaration

August 10, 2015 by Keith Bradnam

The Hague Declaration is an important manifesto that aims to provide guidelines for how to "best enable access to facts, data and ideas for knowledge discovery in the Digital Age". Although signatories to the declaration include large scientific research institutes, you can also sign the declaration as an invididual. The five main principles of the declaration are summarized as follows:

Intellectual property was not designed to regulate the free flow of facts, data and ideas, but has as a key objective the promotion of research activity
People should have the freedom to analyse and pursue intellectual curiosity without fear of monitoring or repercussions
Licenses and contract terms should not restrict individuals from using facts, data and ideas
Ethics around the use of content mining techniques will need to continue to evolve in response to changing technology
Innovation and commercial research based on the use of facts, data, and ideas should not be restricted by intellectual property law

These principles are obviously of huge relevance for the field of genomics which seems to be generating tools and data at an ever increasing rate. So I was happy to read today that the new Francis Crick Institute in London is one of the Declaration's latest signatories:

"The large amounts of data and information that are now becoming available represent an extraordinary resource for researchers. By signing the Hague Declaration the Francis Crick Institute is expressing its support for the idea that researchers should be able to mine such content freely, thereby to advance knowledge and to promote Discovery without Boundaries."

Jim Smith, Director of Research at the Francis Crick Institute