101 questions with a bioinformatician #30: Vince Buffalo

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Vince is a second year graduate student in the lab of Graham Coop at UC Davis. Before that he earned his bioinformatics 'chops' working in other groups on the UC Davis campus as a bioinformatician and statistical programmer.

I came to know Vince when he was working as part of the Genome Center's Bioinformatics Core Facility; I was immediately impressed, not only by his diverse set of computational skills, but by the way he applied those skills. Put simply, Vince does things the right way. He believes that bioinformatics should be a carefully documented, reproducible science. He also sees the strengths and advantages of using core Unix skills to organize and manage bioinformatics pipelines. These skills will provide a more useful, and lasting, toolbox than if you only ever learn how to use the latest and greatest set of published bioinformatics tools.

Impressively, Vince has recently published a book (Bioinformatics Data Skills by O'Reilly), this is something that I highly encourage people to buy, and I'm convinced that it will become an indispensible guide to everyone working in this field. In the book's introduction, he neatly states the problem that I alluded to earlier:

Many biologists starting out in bioinformatics tend to equate “learning bioinformatics” with “learning how to run bioinformatics software.” This is an unfortunate and misinformed idea of what bioinformaticians actually do. This is analogous to thinking “learning molecular biology” is just “learning pipetting." … the approach of this book is to focus on the skills bioinformaticians use to explore and extract meaning from complex, large bioinformatics datasets.

You can find out more about Vince by visiting his 'digital notebook' website at vincebuffalo.org, or by following him on twitter @vsbuffalo. And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

Watching bioinformatics grow to tackle exciting evolutionary questions,especially with non-model organisms. While bioinformatics has clearly revolutionized the human genomics field, I think in the next decade we'll see interesting developments in bioinformatics tailored to problems in complex non-model organism genomics.

I love plants and have worked in plant genomics, and I've seen first hand that it's very hard. Many bioinformatics tools we used were designed to work with human data, not gigantic polyploid genomes. It will be exciting over the next few years to see how reads grow in length, new algorithms emerge, and how this will enable more non-model research. As a budding evolutionary biologist, I'm hopeful that these bioinformatics advances will fuel more discoveries in neat species that have traditionally been harder to work with.

010. What's something that you don't enjoy about current bioinformatics research?

A large proportion of a bioinformatician's time is spent tackling unnecessary human-made problems: data is poorly organized, file formats are both poorly specified and followed, and software is often poorly documented or isn't robust to different data. These are neither interesting scientific problems nor fun computational problems — these are frustrating social and community issues. No one wants to tackle these problems for that reason, but at some point we'll have to as a community — to avoid wasting our collective time on these annoyances.

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Study more mathematics. I fell in love with statistics before I did math because I quickly saw the beauty in using statistics to understand data. Now I'm working backwards and trying to bolster my maths skills and seeing the beauty in other mathematical fields and really enjoying it. Darwin said "mathematics seems to endow one with something like a new sense" — I'd argue that this is especially true in biology.

100. What's your all-time favorite piece of bioinformatics software, and why?

It's a tie — SAMtools and PSMC. SAMtools is an amazing piece of engineering — from an algorithmic perspective, from a usability perspective, and from a community perspective. If you dig inside the source, everything is so cleverly written and carefully optimized (e.g. the klib library). I've learned a lot of C tricks from reading Heng Li's code.

SAMtools is also extremely well designed from the user perspective — it adopts the Unix philosophy and its subcommand interface is much like Git's. However, SAMtools is not a perfect program; there have been numerous bugs found over the years and some folks attack it for this. But these bugs are quickly patched thanks to active development and an excellent community. I don't work on SAMtools (other than one tiny bug fix) but I enjoy following along via GitHub and reading and learning from the source.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

S — and it's a simple puzzle why this is the letter I chose.