101 questions with a bioinformatician #25: Alex Bateman

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Alex Bateman is in charge of Protein Sequence Resources at the European Bioinformatics Institute (EBI). You might know him from his role in developing such popular bioinformatics tools as Pfam, Rfam, and TreeFam (rumors abound that their planned database of Ox genome resources had to be cancelled because they couldn't secure the desired name).

A recipient of the Benjamin Franklin Award for Open Access in the Life Sciences, Alex has also been an enthusiastic advocate of Wikipedia as a resource that scientists should utilize more. To borrow a couple of British-isms, Alex is a 'jolly nice chap' and a 'thoroughly decent bloke', and I say that as someone who once had to stand on a milk crate with him as part of a management training exercise (don't ask!). I sometimes think that he has found the elixir of life as he never seems to age. Oh, and you should also be aware that he has a black belt in origami.

You can find out more about Alex by visiting his group's website at the EBI, or by following him on twitter (@alexbateman1). And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

Data, data, data. When I started out there wasn't really any data. Complete cellular genomes were still not available. About 50,000 protein sequences were known and that was not enough. I worked on the IGPS protein family and built a tree from 4 sequences! For most protein families we didn't have enough sequences to build accurate profile models. Now we are rapidly heading towards having 1 billion protein sequences which I feel is more than enough to build good protein family models. Techniques looking at correlation of positions in multiple sequence alignments to infer interacting residues have really started to deliver results.

010. What's something that you don't enjoy about current bioinformatics research?

I think that too many people work on the same old problems. That isn't specific to bioinformatics, but pervades all science. There is little incentive for scientists to be innovative. It's more difficult to get grants for innovative work, it's more difficult to publish innovative scientific papers. It is of course easier to move forwards incrementally using existing tools, techniques and reagents. There are so many protein of unknown function to work on, yet people spend vast resources studying proteins that are already well characterised. The entire CRISPR system was until a few years ago a collection of uncharacterised protein families identified by bioinformaticians.

Major jumps in science happen when the impossible is attempted. It turns out that the impossible is often much easier than expected!

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Stop worrying so much. Do what you enjoy, there is a niche out there for you!

100. What's your all-time favorite piece of bioinformatics software, and why?

I was the editor for the NAR database issue for four years and so I looked at hundreds of online molecular biology databases.

The STRING database stood out for me because it is so well implemented. It is very easy to use, it looks brilliant and it is super fast. If you want a good model to follow for designing a web based tool then this is it. However, it is not perfect. When picking the name of your web based resource it's best not to give it the same name as a common object such as string. Searching Google for STRING may not give the database as the top hit.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

U - being a nucleotide unique to RNA it reflects my strong focus on non-coding RNAs rather than DNA.