Shining a light on more bogus bioinformatics acronyms

jabba logo.png

Courtesy of an anonymous tip off…

There is a new bioinformatics tool that was described in a recently published BMC Genomics article. Here is the full name of the tool with any capitalization removed:

  • automated tool for bacterial genome annotation comparison

So can you guess what acronym/name was extracted from this description?

  • ATBGAC?
  • AutoBGA
  • BGAC?

No. The JABBA-award-winning name of this tool is as follows:

This name really isn't helped by the fact that it is shown as follows in the journal article title (with the G of 'Genome' also capitalized):

  • BEACON: automated tool for Bacterial GEnome Annotation ComparisON

101 questions with a bioinformatician #31: Morgan Taschuk

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Morgan Taschuk is a Senior Manager for Genome Sequence Informatics at the Ontario Institute for Cancer Research (OICR). She manages the production sequence analysis team to analyse all of the sequence data sequenced at OICR, resulting in the generation of alignment files, variant calls, QC metrics and other bountiful amounts of sequence data for OICR researchers and collaborators.

She recently wrote a great blog post regarding the (sometimes contentious) issue of Biologists vs Bioinformaticians. Definitely worth a read. Morgan has also recently started to assemble a Twitter list of Women in Bioinformatics, now up to 179 members. I'm sure she would like to make that list even longer, so please let her know of any omissions.

You can find out more about Morgan by visiting her Modern Model Organism blog, or by following her on twitter (@morgantaschuk). And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

There's always something more to learn. I'm spending a lot of time with our genomics lab recently and learning about how lab processes impact our data fascinates me. Bioinformatics skills are usually in demand so I also get to work with a wide variety of people with different questions and problems and have to stretch my brain to apply myself.



010. What's something that you don't enjoy about current bioinformatics research?

Often people write their own scripts or software instead of looking for something that already exists out there. Not only is it wasted effort for very similar results, it sabotages any attempt to standardize across the field. Open-source software is there for everyone to change and improve. Why not build on a foundation instead of digging the hole yourself?



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Since nobody can tell you what bioinformatics is, it's up to you to define it. I spent a long time fighting with imposter syndrome, not just because I felt inadequate but also because I was called a bioinformatician when I didn't fit the classical model. Nobody fits the classical model these days. Thinking about this question actually inspired me to write a blog post about the difference between bioinformaticians and computational biologists. Judging from the feedback on Twitter and the blog, the problem of defining what a bioinformatician is still really sticks in people's throats.



100. What's your all-time favorite piece of bioinformatics software, and why?

SAMtools. It's an amazing piece of very stable, utilitarian, open source code that forms the backbone of most sequencing pipelines.



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

I struggled the most with this question! Y, because 'pyrimidine' is a pretty word and so Y not.

~crickets~

The names of bioinformatics tools that help study evolution shouldn't feel that they also have to evolve

Thanks to Torsten Seemann for bringing this to my attention…


In 2003 a bioinformatics tool was published. A tool with a thoroughly sensible name and acronym:

A simple name with a simple, and not-too-bogus. initialism. Bravo. However, a subsequent update to BIBI brought about a change to the name:

  • le BIBI:

Where the 'le' refers to 'light edition'. It should be said that most references to this tool drop the superscript notation for 'le'. Let's move forward to the present day and the publication of another version of this tool:

The full expansion of this new name is as follows:

  • Light Edition Bioinformatics Bacterial Identification Tool Quick Bioinformatic Phylogeny of Prokaryotes

Quite a mouthful! Bonus points for including 'Bioinformatics' and 'Bioinformatic' as part of the same name, as well as the largely redundant inclusion of 'Bacterial' as well as 'Prokaryotes'.

Generally I find use of superscript in software names to be largely unnecessary. It can make the tool name harder to read and it is unlikely to reproduced verbatim by others who mention your software. Starting your software with a lowercase letter also means that this might appear in uppercase if used to start a sentence (as happens several times in the above paper). Not a terrible problem but it reduces the strength and consistency of your 'bioinformatics brand'.

New JABBA award for Just Another Bogus Bioinformatics Acronym

Here's a new tool that was described recently in the journal of Bioinformatics:

Let's see how the letters in 'GREGOR' are derived:

  1. G = Genomic — a solid start
  2. R = Regulatory — fine
  3. E = Elements — all good so far
  4. 'and' — okay, we'll allow a conjunction or two
  5. G = Gwas — hmm, including an acronym/initialism inside another acronym is rarely a good idea
  6. O = Overlap — that's fine, time for the big finish…
  7. R = algoRithm — oh come on!

Congratulations GREGOR — or should I say GREGOA? — you win a JABBA award!

If you want your bioinformatics software to have a memorable name, it helps if the name is pronounceable

Image from she is geeky blog

There is a new paper in the journal Bioinformatics:

The paper describes a new method for implementing a Principle Components Analysis (PCA) of data. That new method has a name. That name has just seven characters. How hard can it be to pronounce?

  • S4VDPCA: ess-four-vee-dee-pee-cee-ay

It doesn't exactly trip off the tongue and having four 'ee-sounding' letters together (VDPC) doesn't make it easy to remember. When I first came across this paper, I skimmed the article, waited an hour, and then tried to remember the name. I could remember that it included '4', 'V', and 'D', but couldn't remember the order (or that it also included an 'S')

It is by no means essential that bioinformatics tools have easily pronounceable names, but this will help people remember the name of your software. In turn, this makes it easier for people to tell others about your software. I don't imagine that bioinformatics software developers ever want to overhear the following type of conversation:

Bob: "You should use that tool"

Sue: "What tool?"

Bob: "Umm, you know that PCA thingy. The S…something, something…PCA tool"

Sue: "The what?"

Bob: "Run a Google search for Bioinformatics PCA tools, it's probably the top hit."

Sue: <- facepalm ->

The Francis Crick Institute has signed the Hague Declaration

The Hague Declaration is an important manifesto that aims to provide guidelines for how to "best enable access to facts, data and ideas for knowledge discovery in the Digital Age". Although signatories to the declaration include large scientific research institutes, you can also sign the declaration as an invididual. The five main principles of the declaration are summarized as follows:

  1. Intellectual property was not designed to regulate the free flow of facts, data and ideas, but has as a key objective the promotion of research activity
  2. People should have the freedom to analyse and pursue intellectual curiosity without fear of monitoring or repercussions
  3. Licenses and contract terms should not restrict individuals from using facts, data and ideas
  4. Ethics around the use of content mining techniques will need to continue to evolve in response to changing technology
  5. Innovation and commercial research based on the use of facts, data, and ideas should not be restricted by intellectual property law

These principles are obviously of huge relevance for the field of genomics which seems to be generating tools and data at an ever increasing rate. So I was happy to read today that the new Francis Crick Institute in London is one of the Declaration's latest signatories:

"The large amounts of data and information that are now becoming available represent an extraordinary resource for researchers. By signing the Hague Declaration the Francis Crick Institute is expressing its support for the idea that researchers should be able to mine such content freely, thereby to advance knowledge and to promote Discovery without Boundaries."

Jim Smith, Director of Research at the Francis Crick Institute

Front Line Genomics interview with Craig Venter includes a question from yours truly

Issue 4 of the Front Line Genomics magazine is now available online. It includes an interview with Craig Venter who gave a much anticipated talk at their recent Festival of Genomics conference in Boston. Front Line Genomics kindly allowed some of their previous interviewees (which includes me) to pose some of the questions. Here's mine:

KRB: What do you see as the limits of synthetic biology? Could we assemble a functional eukaryotic genome, and what are the practical applications of such technology?

JCV: That’s a great question! The limitations will ultimately be more society limitations, ethical limitations, and standards rather than technology. I think a synthetic single eukaryotic cell would be very straightforward to do today. Various groups of scientists have been trying to build the yeast genome. It’s kind of like rebuilding a house one brick at a time, but they’re making a synthetic version of yeast. That’s not quite the same as writing the genetic code and then booting it up as we did, but that’s just because of the limitations on writing the genetic code now.

I think understanding what makes a multicellular organism, and all the regulation associated with that, are so far away from design that we’re going to have to learn a whole lot more biology before we get to that stage of deliberate design. I think about 10% of the genes in our designed synthetic bacterial cell, are of unknown function. All we know is that you can’t get life without them. That problem expands tremendously with eukaryotic cells. If you extrapolate to the challenge of interpreting the human genome, we only understand a tiny fraction of the human genome today.

Get With the Program: DIY tips for adding coding to your analysis arsenal

A new article in The Scientist magazine by Jeffrey M. Perkel shares some coding advice from Cole Trapnell, C. Titus Brown, and Vince Buffalo (I interviewed Vince in my last blog post). It is a great article, and worth a look. I particularly enjoyed this piece of advice (something that is not mentioned enough):

Treat data as "read-only"
Use an abundance of caution when working with your hard-won data, Buffalo says. For instance, “treat data as read-only.” In other words, don’t work with original copies of the data, make working copies instead. “If you have the data in an Excel spreadsheet and you make a change, that original data is gone forever,” he says.

I have seen too many students double click on FASTA, GFF, and other large bioinformatics text files and end up 'viewing' them in some inappropriate program (including Microsoft Word). If you want to view text data, use a text viewer (such as less).

101 questions with a bioinformatician #30: Vince Buffalo

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Vince is a second year graduate student in the lab of Graham Coop at UC Davis. Before that he earned his bioinformatics 'chops' working in other groups on the UC Davis campus as a bioinformatician and statistical programmer.

I came to know Vince when he was working as part of the Genome Center's Bioinformatics Core Facility; I was immediately impressed, not only by his diverse set of computational skills, but by the way he applied those skills. Put simply, Vince does things the right way. He believes that bioinformatics should be a carefully documented, reproducible science. He also sees the strengths and advantages of using core Unix skills to organize and manage bioinformatics pipelines. These skills will provide a more useful, and lasting, toolbox than if you only ever learn how to use the latest and greatest set of published bioinformatics tools.

Impressively, Vince has recently published a book (Bioinformatics Data Skills by O'Reilly), this is something that I highly encourage people to buy, and I'm convinced that it will become an indispensible guide to everyone working in this field. In the book's introduction, he neatly states the problem that I alluded to earlier:

Many biologists starting out in bioinformatics tend to equate “learning bioinformatics” with “learning how to run bioinformatics software.” This is an unfortunate and misinformed idea of what bioinformaticians actually do. This is analogous to thinking “learning molecular biology” is just “learning pipetting." … the approach of this book is to focus on the skills bioinformaticians use to explore and extract meaning from complex, large bioinformatics datasets.

You can find out more about Vince by visiting his 'digital notebook' website at vincebuffalo.org, or by following him on twitter @vsbuffalo. And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

Watching bioinformatics grow to tackle exciting evolutionary questions,especially with non-model organisms. While bioinformatics has clearly revolutionized the human genomics field, I think in the next decade we'll see interesting developments in bioinformatics tailored to problems in complex non-model organism genomics.

I love plants and have worked in plant genomics, and I've seen first hand that it's very hard. Many bioinformatics tools we used were designed to work with human data, not gigantic polyploid genomes. It will be exciting over the next few years to see how reads grow in length, new algorithms emerge, and how this will enable more non-model research. As a budding evolutionary biologist, I'm hopeful that these bioinformatics advances will fuel more discoveries in neat species that have traditionally been harder to work with.



010. What's something that you don't enjoy about current bioinformatics research?

A large proportion of a bioinformatician's time is spent tackling unnecessary human-made problems: data is poorly organized, file formats are both poorly specified and followed, and software is often poorly documented or isn't robust to different data. These are neither interesting scientific problems nor fun computational problems — these are frustrating social and community issues. No one wants to tackle these problems for that reason, but at some point we'll have to as a community — to avoid wasting our collective time on these annoyances.



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Study more mathematics. I fell in love with statistics before I did math because I quickly saw the beauty in using statistics to understand data. Now I'm working backwards and trying to bolster my maths skills and seeing the beauty in other mathematical fields and really enjoying it. Darwin said "mathematics seems to endow one with something like a new sense" — I'd argue that this is especially true in biology.



100. What's your all-time favorite piece of bioinformatics software, and why?

It's a tie — SAMtools and PSMC. SAMtools is an amazing piece of engineering — from an algorithmic perspective, from a usability perspective, and from a community perspective. If you dig inside the source, everything is so cleverly written and carefully optimized (e.g. the klib library). I've learned a lot of C tricks from reading Heng Li's code.

SAMtools is also extremely well designed from the user perspective — it adopts the Unix philosophy and its subcommand interface is much like Git's. However, SAMtools is not a perfect program; there have been numerous bugs found over the years and some folks attack it for this. But these bugs are quickly patched thanks to active development and an excellent community. I don't work on SAMtools (other than one tiny bug fix) but I enjoy following along via GitHub and reading and learning from the source.



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

S — and it's a simple puzzle why this is the letter I chose.