Awkward Bioinformatics Conversations #1

Image from flickr user hades2k

Bob: Hi Sue, it's Bob. Got a favor to ask. Could you load up the UCSC Genome Browser site in your web browser please?

Sue: Hi Bob. So just to clarify…do you want me to load the UCSC Genome Browser homepage, or the UCSC Genome Browser website Genome Browser tool?

Bob: Wait, what?

Sue: UCSC Genome Browser is both the name of the website — as identified in their HTML metadata — and also the name of a tool on that website.

Bob: Er, just go to the main website first.

Sue: Done.

Bob: So it says that it's the UCSC Genome Browser website?

Sue: Yes. And no.

Bob: Huh?

Sue: The homepage identifies itself as the UCSC Genome Bioinformatics site but it also welcomes you to the UCSC Genome Browser website.

Bob: Okay, that sounds like you're looking at the right page then. So can you now please click on the Genome Browser link at the top of the page?

Sue: Which one?

Bob: What? Er, the one in the toolbar I guess.

Sue: Which one?

Bob: I just told you which one.

Sue: No, which toolbar?

Bob: There's more than one?

Sue: There's the horizontal toolbar that mostly contains dropdown menus with links that expose most of the site's functionality…and then there's the vertical toolbar which mostly offers links to items that also exist in sections of the horizontal toolbar.

Bob: But surely there's only one toolbar link which says 'Genome Browser'?

Sue: No, there are two.

Bob: But they go to the same place, right?

Sue: No. The horizontal toolbar link for 'Genome Browser' takes you into the Genome Browser tool with data loaded for the human genome assembly. The vertical toolbar link for 'Genome Browser' takes you to an intermediate page that lets you access the 'Human Genome Browser gateway'. Which one do you want? Bob? Hello???

Shining a light on more bogus bioinformatics acronyms

jabba logo.png

Courtesy of an anonymous tip off…

There is a new bioinformatics tool that was described in a recently published BMC Genomics article. Here is the full name of the tool with any capitalization removed:

  • automated tool for bacterial genome annotation comparison

So can you guess what acronym/name was extracted from this description?

  • ATBGAC?
  • AutoBGA
  • BGAC?

No. The JABBA-award-winning name of this tool is as follows:

This name really isn't helped by the fact that it is shown as follows in the journal article title (with the G of 'Genome' also capitalized):

  • BEACON: automated tool for Bacterial GEnome Annotation ComparisON

101 questions with a bioinformatician #31: Morgan Taschuk

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Morgan Taschuk is a Senior Manager for Genome Sequence Informatics at the Ontario Institute for Cancer Research (OICR). She manages the production sequence analysis team to analyse all of the sequence data sequenced at OICR, resulting in the generation of alignment files, variant calls, QC metrics and other bountiful amounts of sequence data for OICR researchers and collaborators.

She recently wrote a great blog post regarding the (sometimes contentious) issue of Biologists vs Bioinformaticians. Definitely worth a read. Morgan has also recently started to assemble a Twitter list of Women in Bioinformatics, now up to 179 members. I'm sure she would like to make that list even longer, so please let her know of any omissions.

You can find out more about Morgan by visiting her Modern Model Organism blog, or by following her on twitter (@morgantaschuk). And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

There's always something more to learn. I'm spending a lot of time with our genomics lab recently and learning about how lab processes impact our data fascinates me. Bioinformatics skills are usually in demand so I also get to work with a wide variety of people with different questions and problems and have to stretch my brain to apply myself.



010. What's something that you don't enjoy about current bioinformatics research?

Often people write their own scripts or software instead of looking for something that already exists out there. Not only is it wasted effort for very similar results, it sabotages any attempt to standardize across the field. Open-source software is there for everyone to change and improve. Why not build on a foundation instead of digging the hole yourself?



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Since nobody can tell you what bioinformatics is, it's up to you to define it. I spent a long time fighting with imposter syndrome, not just because I felt inadequate but also because I was called a bioinformatician when I didn't fit the classical model. Nobody fits the classical model these days. Thinking about this question actually inspired me to write a blog post about the difference between bioinformaticians and computational biologists. Judging from the feedback on Twitter and the blog, the problem of defining what a bioinformatician is still really sticks in people's throats.



100. What's your all-time favorite piece of bioinformatics software, and why?

SAMtools. It's an amazing piece of very stable, utilitarian, open source code that forms the backbone of most sequencing pipelines.



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

I struggled the most with this question! Y, because 'pyrimidine' is a pretty word and so Y not.

~crickets~

The names of bioinformatics tools that help study evolution shouldn't feel that they also have to evolve

Thanks to Torsten Seemann for bringing this to my attention…


In 2003 a bioinformatics tool was published. A tool with a thoroughly sensible name and acronym:

A simple name with a simple, and not-too-bogus. initialism. Bravo. However, a subsequent update to BIBI brought about a change to the name:

  • le BIBI:

Where the 'le' refers to 'light edition'. It should be said that most references to this tool drop the superscript notation for 'le'. Let's move forward to the present day and the publication of another version of this tool:

The full expansion of this new name is as follows:

  • Light Edition Bioinformatics Bacterial Identification Tool Quick Bioinformatic Phylogeny of Prokaryotes

Quite a mouthful! Bonus points for including 'Bioinformatics' and 'Bioinformatic' as part of the same name, as well as the largely redundant inclusion of 'Bacterial' as well as 'Prokaryotes'.

Generally I find use of superscript in software names to be largely unnecessary. It can make the tool name harder to read and it is unlikely to reproduced verbatim by others who mention your software. Starting your software with a lowercase letter also means that this might appear in uppercase if used to start a sentence (as happens several times in the above paper). Not a terrible problem but it reduces the strength and consistency of your 'bioinformatics brand'.

New JABBA award for Just Another Bogus Bioinformatics Acronym

Here's a new tool that was described recently in the journal of Bioinformatics:

Let's see how the letters in 'GREGOR' are derived:

  1. G = Genomic — a solid start
  2. R = Regulatory — fine
  3. E = Elements — all good so far
  4. 'and' — okay, we'll allow a conjunction or two
  5. G = Gwas — hmm, including an acronym/initialism inside another acronym is rarely a good idea
  6. O = Overlap — that's fine, time for the big finish…
  7. R = algoRithm — oh come on!

Congratulations GREGOR — or should I say GREGOA? — you win a JABBA award!

If you want your bioinformatics software to have a memorable name, it helps if the name is pronounceable

Image from she is geeky blog

There is a new paper in the journal Bioinformatics:

The paper describes a new method for implementing a Principle Components Analysis (PCA) of data. That new method has a name. That name has just seven characters. How hard can it be to pronounce?

  • S4VDPCA: ess-four-vee-dee-pee-cee-ay

It doesn't exactly trip off the tongue and having four 'ee-sounding' letters together (VDPC) doesn't make it easy to remember. When I first came across this paper, I skimmed the article, waited an hour, and then tried to remember the name. I could remember that it included '4', 'V', and 'D', but couldn't remember the order (or that it also included an 'S')

It is by no means essential that bioinformatics tools have easily pronounceable names, but this will help people remember the name of your software. In turn, this makes it easier for people to tell others about your software. I don't imagine that bioinformatics software developers ever want to overhear the following type of conversation:

Bob: "You should use that tool"

Sue: "What tool?"

Bob: "Umm, you know that PCA thingy. The S…something, something…PCA tool"

Sue: "The what?"

Bob: "Run a Google search for Bioinformatics PCA tools, it's probably the top hit."

Sue: <- facepalm ->

The Francis Crick Institute has signed the Hague Declaration

The Hague Declaration is an important manifesto that aims to provide guidelines for how to "best enable access to facts, data and ideas for knowledge discovery in the Digital Age". Although signatories to the declaration include large scientific research institutes, you can also sign the declaration as an invididual. The five main principles of the declaration are summarized as follows:

  1. Intellectual property was not designed to regulate the free flow of facts, data and ideas, but has as a key objective the promotion of research activity
  2. People should have the freedom to analyse and pursue intellectual curiosity without fear of monitoring or repercussions
  3. Licenses and contract terms should not restrict individuals from using facts, data and ideas
  4. Ethics around the use of content mining techniques will need to continue to evolve in response to changing technology
  5. Innovation and commercial research based on the use of facts, data, and ideas should not be restricted by intellectual property law

These principles are obviously of huge relevance for the field of genomics which seems to be generating tools and data at an ever increasing rate. So I was happy to read today that the new Francis Crick Institute in London is one of the Declaration's latest signatories:

"The large amounts of data and information that are now becoming available represent an extraordinary resource for researchers. By signing the Hague Declaration the Francis Crick Institute is expressing its support for the idea that researchers should be able to mine such content freely, thereby to advance knowledge and to promote Discovery without Boundaries."

Jim Smith, Director of Research at the Francis Crick Institute

Front Line Genomics interview with Craig Venter includes a question from yours truly

Issue 4 of the Front Line Genomics magazine is now available online. It includes an interview with Craig Venter who gave a much anticipated talk at their recent Festival of Genomics conference in Boston. Front Line Genomics kindly allowed some of their previous interviewees (which includes me) to pose some of the questions. Here's mine:

KRB: What do you see as the limits of synthetic biology? Could we assemble a functional eukaryotic genome, and what are the practical applications of such technology?

JCV: That’s a great question! The limitations will ultimately be more society limitations, ethical limitations, and standards rather than technology. I think a synthetic single eukaryotic cell would be very straightforward to do today. Various groups of scientists have been trying to build the yeast genome. It’s kind of like rebuilding a house one brick at a time, but they’re making a synthetic version of yeast. That’s not quite the same as writing the genetic code and then booting it up as we did, but that’s just because of the limitations on writing the genetic code now.

I think understanding what makes a multicellular organism, and all the regulation associated with that, are so far away from design that we’re going to have to learn a whole lot more biology before we get to that stage of deliberate design. I think about 10% of the genes in our designed synthetic bacterial cell, are of unknown function. All we know is that you can’t get life without them. That problem expands tremendously with eukaryotic cells. If you extrapolate to the challenge of interpreting the human genome, we only understand a tiny fraction of the human genome today.

Get With the Program: DIY tips for adding coding to your analysis arsenal

A new article in The Scientist magazine by Jeffrey M. Perkel shares some coding advice from Cole Trapnell, C. Titus Brown, and Vince Buffalo (I interviewed Vince in my last blog post). It is a great article, and worth a look. I particularly enjoyed this piece of advice (something that is not mentioned enough):

Treat data as "read-only"
Use an abundance of caution when working with your hard-won data, Buffalo says. For instance, “treat data as read-only.” In other words, don’t work with original copies of the data, make working copies instead. “If you have the data in an Excel spreadsheet and you make a change, that original data is gone forever,” he says.

I have seen too many students double click on FASTA, GFF, and other large bioinformatics text files and end up 'viewing' them in some inappropriate program (including Microsoft Word). If you want to view text data, use a text viewer (such as less).