Developing CEGMA: how working on old code can drive you mad and some tips on how to avoid this

Today marks the day when the original paper that describes the CEGMA software (Core Eukaryotic Gene Mapping Approach) becomes my most cited paper (as tracked by Google Scholar):

Does this fact make me happy? Not really. In fact, you may be surprised to learn that I find working on CEGMA a little bit depressing. I say this on a day when, purely coincidentally, I am releasing a new version of CEGMA. Why the grumpy face Keith? (I hear you ask). Let's take a trip down memory lane to find out why:

  • Early 2004: A paper is published that describes the KOGs database of euKaryotic Orthologous Groups.
  • Early 2005: I become the first person to join the Korf Lab after Ian Korf moves to Davis in 2004.
  • Mid 2005: Genís Parra becomes the second person to join the lab.
  • 2005–2006: The three of us work on the idea which became CEGMA. This project was primarily driven forward by Genís; during this time our initial CEGMA manuscript was rejected by two journals.
  • Late 2006: Our CEGMA paper was accepted!
  • Early 2007: CEGMA paper is published — as an aside, the URL for CEGMA that we include in the paper still works!
  • 2007: We work on the CEGMA spin-off idea: that it can be used to assess the 'gene space' of draft genomes.
  • 2008: Write new manuscript, get rejected twice (again), finally get accepted late 2008.
  • Early 2009: The 2nd CEGMA paper gets published!
  • Mid 2010: Genís leaves the lab.

By the time Genís left Davis, our original CEGMA paper had been cited 11 times (one of which was by our second CEGMA paper). I think that we had all expected the tool to have been a little more popular, but our expectations had been dampened somewhat by the difficulties in getting the paper published. Anyway, no sooner than Genís had left the lab, then the paper started getting a lot more attention:

Growth in citations to the two CEGMA papers.

This was in no doubt due to its use as a tool in the Assemblathon 1 paper (of which I was also involved), a project that started in late 2010. However, any interest generated from the Assemblathon project probably just reflected the fact that everyone and their dog had started sequencing genomes and producing — how best to describe them? —'assemblies of questionable quality'.

This is also about the time when I started to turn into this guy:

This was because it had fallen on me to continue to deal with all CEGMA-related support requests. Until 2010, there hadn't really been any support requests because almost no-one was using CEGMA. This changed dramatically and I started to receive lots of emails that:

  • Asked questions about interpreting CEGMA output
  • Reported bugs
  • Asked for help installing CEGMA
  • Suggested new features
  • Asked me to run CEGMA for them

I started receiving lots of the latter requests because CEGMA is admittedly a bit of a pig to install (on non Mac-based Unix systems at least). In the last 6 months alone, I've run CEGMA 80 times for various researchers who (presumably) are unable to install it themselves.

After the version 2.3 release — necessary to transition to the use of NCBI BLAST+ instead of WU-BLAST — and 2.4 release — necessary to fix the bugs I introduced in v2.3! — I swore an oath never to update CEGMA again. This was mostly because we no longer have any money to work on the current version of CEGMA. However, it was also because it is not much fun to spend your days working on code that you barely understand.

It should be said that we do have plans for a completely new version of CEGMA that will — subject to our grant proposal being successful — be redeveloped from the ground up, and will include many completely new features. Perhaps most importantly — for me at least — a version 3.0 release of CEGMA will be much more maintainable.

And now we get to the main source of my ire when dealing with CEGMA. It is built on a complex web of Perl scripts and modules, which make various system calls to run BLAST, genewise, geneid, and hmmsearch (from HMMER). I still find the scripts difficult to understand — I didn't write any of the original code — and therefore I find it almost impossible to maintain. One of the reasons I had to make this v2.5 update is because the latest versions of Perl have deprecated a particular feature causing CEGMA to break for some people.

Most fundamentally, the biggest problem with CEGMA (v2.x) is that it is centered around use of the KOGs database, a resource that is now over a decade old. This wasn't an issue when we were developing the software in 2005, but it is an issue now. Our plans for CEGMA v3.0 will address this by moving to a much more modern source of orthologous group information.

In making this final update to v2.x of CEGMA, I've tried adopting some changes to bring us up to date with the modern age. Although the code remains available from our lab's website, I've also pushed the code to GitHub (which wasn't in existence when we started developing CEGMA!). In doing this, I've also taken the step to give our repository a DOI and therefore make the latest version citable in its own right. This is done through use of Zenodo.

Although I hope that this is the last thing that I ever have to write about CEGMA v2.x, it is worth reflecting on some of the ways that the process of managing and maintaining CEGMA could have been made easier:

  1. Maintain documentation for your code that is more than just an installation guide and a set of embedded comments. From time to time, I've had some help from Genís in understanding how the code is working, but the complexity of this software really requires a detailed document that explains how and why everything works the way it does. There have been times when I have been unable to help people with CEGMA-related questions because I still can't understand what some of the code is doing.
  2. Start a FAQ file from day one. This is something that, foolishly, I have only recently started. I could have probably saved myself many hours of email-related support if I had sorted this out earlier.
  3. Put your code online for others to contribute to. Although GitHub wasn't around when we started CEGMA, I could have put the code up there at some point before today!
  4. Don't assume that people will use a mailing list for support, or even contact you directly. One thing I did do many years ago, is set up a CEGMA mailing list. However, I'm still surprised that many people just report their CEGMA problems on sites like SEQanswers or BioStars. I probably should have started checking these sites earlier.
  5. Don't underestimate how much time can be spent supporting software! I probably should have started setting aside a fixed portion of time each week to deal with CEGMA-related issues, rather than trying to tackle things as and when they landed on my doorstep.
  6. Assume that you will not be the last person to manage a piece of software. There are many things you can do to start good practices very early on, including using email addresses for support which are not tied to a personal account, ensuring that your changes to the code base have meaningful (and helpful) commit messages, and making sure that more than one person has access to wherever the code is going to end up.

In some ways it is very unusual for software to have this type of popularity where people only start using it several years after it is originally developed. But as CEGMA shows, it can happen, and hopefully these notes will serve as a bit of a warning to others who are developing bioinformatics software.

Fun with an 'error message' from NCBI BLAST+

Consider this very simple DNA sequence in FASTA format:

>sequence1
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
ttttttagaaaaattatttttaagaatttttcattttaggaatattgtta
tttcagaaaatagctaaatgtgatttctgtaattttgcctgccaaattcg
tgaaatgcaataaaaatctaatatccctcatcagtgcgatttccgaatca
gtatatttttacgtaatagcttctttgacatcaataagtatttgcctata
tgactttagacttgaaattggctattaatgccaatttcatgatatctagc
cactttagtataattgtttttagtttttggcaaaactattgtctaaacag

If you try converting this to a BLAST database using the 'makeblastdb' command from the latest version of NCBI's BLAST+ suite, you will see the following line included in the output:

Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 100% ambiguous nucleotides (shouldn't be over 40%)

Now consider what happens if you run the same makeblastdb command on this sequence (which just switches the first two lines of sequence1):

>sequence2
ttttttagaaaaattatttttaagaatttttcattttaggaatattgtta
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
tttcagaaaatagctaaatgtgatttctgtaattttgcctgccaaattcg
tgaaatgcaataaaaatctaatatccctcatcagtgcgatttccgaatca
gtatatttttacgtaatagcttctttgacatcaataagtatttgcctata
tgactttagacttgaaattggctattaatgccaatttcatgatatctagc
cactttagtataattgtttttagtttttggcaaaactattgtctaaacag

Although this sequence has the exact same proportion of As, Cs, Gs, Ts, and Ns, it does not produce the error message. What about the following sequence?

>sequence3
nnnac
ttttttagaaaaattatttttaagaatttttcattttaggaatattgtta
tttcagaaaatagctaaatgtgatttctgtaattttgcctgccaaattcg
tgaaatgcaataaaaatctaatatccctcatcagtgcgatttccgaatca
gtatatttttacgtaatagcttctttgacatcaataagtatttgcctata
tgactttagacttgaaattggctattaatgccaatttcatgatatctagc
cactttagtataattgtttttagtttttggcaaaactattgtctaaacag

Well, surprise surprise, this sequence produces the error message again (though it now tells you that the first line 'is about 60% ambiguous nucleotides'). You will still see the same message even if you added 1 billion As, Cs, Gs, and Ts on to the end of sequence 3. This seems to be the code responsible for the error message (taken from this page):

In case it wasn't obvious, here is why this annoys me:

  1. The comment in the code indicates that this should be treated as a warning (less serious), but then the message starts with a prefix of 'Error' (more serious). So it's an warning and an error?
  2. It only considers the first line of sequence data. I appreciate that this is easiest thing to check, but it is not very useful if all of your ambiguous bases are at the end of the sequence (or anywhere past the first line).
  3. What is the rationale for choosing 40% as the threshold for warning? It seems a little too arbitrary.
  4. It produces this warning if the first line at least 40% ambiguous and if it also has a length greater than 3 bp! This means that it can be triggered with a line that starts 'NNNAC' as in my sequence3 example above.
  5. It considers all ambiguity codes as being equal. So if I switched my first line of sequence3 from NNNAC to RWBAC, it would still be rejected even though the sequence RWB contains much more information than NNN.
  6. The way the output text bluntly says 'shouldn't be over 40%' comes across as very matter-of-fact, as if you've transgressed some unknown law of bioinformatics.

So here are my suggestions for an alternative (which admittedly requires some more coding):

  1. If a sequence is less than 1,000 bp check all of the sequence, otherwise check the first 1,000 bp of sequence (if not more).
  2. Report the output as a warning and not an error.
  3. Change the warning message. E.g. 'The first 1,000 bp of your sequence contains a high proportion (X%) of ambiguous bases. Such sequences may not be very useful for any downstream analysis that you perform with BLAST+.'

101 questions with a bioinformatician #7: Holly Bik

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Holly Bik is currently a Postdoctoral Researcher in Jonathan Eisen’s lab at the UC Davis Genome Center…but not for much longer! Sadly (for us), Holly will soon be leaving Davis to take up a Faculty position in the School of Biosciences at the University of Birmingham, UK.

As a Birmingham Fellow in Bioinformatics, she will no longer be saying things such as "Dude, it's like totally, hella hot" (this is how we all talk in California), and will instead be referring to the weather in the correct British vernacular "I say, one finds this rain jolly bracing". As Curry Capital of Britain she will also be required (under Birmingham law) to dramatically increase her intake of onion bhajjis,  aloo gobi, and peshawari naans (something that sadly seem to have been outlawed in Northern California).

During her time at UC Davis, Holly has been working on PhyloSift, a software pipeline for the phylogenetic analysis of genomes and metagenomes. She has also been working on many other things. You can find out more about Holly by following her on twitter (@hollybik) or by visiting www.hollybik.com. And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

Given my biology background, my favorite aspect of any bioinformatics project is interacting with people from different disciplines (project personnel, and/or talking to people at meetings and on Twitter). I learn something new about computing, software, and/or hardware pretty much during every project.

I’m always astounded by how technology and computers have progressed since I got my first personal computer (way back in 1996), and how we’re now leveraging this computing power in conjunction with deep DNA sequencing technologies to address fundamental scientific questions. The power of bioinformatics is really incredible when you stop and think about it!

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

The lack of documentation for a lot of software packages, and to a lesser extent, encountering uninformative error messages when trying to run command line software that’s supposedly designed for researchers. Both can be a prohibitive barrier to testing out different tools that may actually be extremely useful and informative for your own research. I think there’s a reason that QIIME has become such a powerhouse package for microbiome research — biologists have access to a suite of tutorials, test datasets, and they can boot up the software easily as a Virtual Machine or Amazon Cloud instance.

However, the easiest tools to install and use are not necessarily the best to use for your particular research questions. I read so many software papers describing exciting new software (where the authors usually all come from computer science departments), but when I visit the website I find no useable instructions or run into insurmountable errors when trying to install or execute the code. As a biologist, no one ever sat down and taught me the nitty gritty about makefiles or compiling source code; people that publish software shouldn’t assume their users have a computer science degree. Most computational biologists will make a valiant effort to overcome such problems, but at some point you have to do a cost/benefit analysis of whether persevering is worth your time. I’m only going to spend two days trying to install your software if I think its really really worth it, but in most cases I’ll probably decide that it isn't (and so no citation for you).

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

FILE MANAGEMENT. Read up on the best practices for data management, and start forcing yourself to develop good habits NOW. I guarantee that in 5 years' time you will not remember what data you saved in 'analyses.txt'.

 

100. What's your all-time favorite piece of bioinformatics software, and why?

I’m going to be shamelessly biased: my favorite software ever is Phinch, a data visualization framework that I’ve been developing in collaboration with Pitch Interactive — a data visualization studio in Berkeley, CA. We’re using solid software engineering and design principles to build exploratory, interactive visualization tools for scientists. And because the visualizations are built in 3D, the user interface is absolutely gorgeous! Who says you can’t create art when doing bioinformatics research?

 

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

I’m a gap (. or -), because I’m mysterious and don’t like to be classified!

Survey results: The extent of gender bias in bioinformatics

I have completed an analysis of my survey that attempted to see whether there is notable gender bias among bioinformaticians. Thank you to the 370 people that completed the survey! A few things to note:

  1. All survey responses are available on Figshare (in tab-separated value format). Anyone else can come along and play with this data, and maybe ask more intelligent questions about it than I did.
  2. My detailed analysis of these responses is also on Figshare as a separate document.
  3. The original Google survey form remains available (also see my blog post about it). If people continue to complete the survey, I will update the main data file on Figshare.

I encourage people to read the full document on Figshare. Because of the high response to this survey, I had enough data to compare gender bias at different career stages, and also between different countries (for a small number of countries).

I'll leave you with just one result from my analysis. I had asked people to identify their current career position, and  I offered 10 possible career stages as answers:

  1. Currently pursuing undergraduate degree (with focus on bioinformatics/genomics
  2. Undergraduate level position in academia or industry  (e.g. Research officer / Junior specialist)
  3. Currently pursuing postgraduate qualification (with focus on bioinformatics/genomics)
  4. Postgraduate level position (e.g. Research assistant). MSc or PhD required for role.
  5. Postdoctoral scholar / Fellow / Research Associate
  6. Lecturer / Instructor/ Senior Fellow / Project Scientist (3+ years post-PhD research experience)
  7. Assistant Professor / Reader / Senior Lecturer (5+ years post-PhD research experience)
  8. Associate or Full Professor / Team Leader (7+ years post-PhD research experience)
  9. Senior Professorial role (e.g. head of a department, 10+ years post-PhD research experience)
  10. Super Senior role (e.g. Dean of a school or CEO, 15+ years post-PhD research experience)

Because these categories are a little bit subjective, and because some of the categories (levels 1, 9, and 10) had the least number of responses, I decided to smooth the data by combining adjacent categories. I.e. 1&2, 2&3, etc.

So this is what the percentage of male and female bioinformaticians looks like with respect to progress through their scientific career:

Things start off looking quite equitable but proceed to diverge around the time that people are becoming Associate Professors. However, the situation is more complex than this (see Figure 3 in my full analysis).

Can Twitter help us find out the gender ratio of bioinformaticians?

I'm still collecting survey results to try to understand the extent of gender bias in bioinformatics. I plan to publish an analysis of these results next week and I'll also share all of the the raw survey results via Figshare (in case anyone else wants to dive deeper).

One thing that is hard to accurately know is just what the gender ratio is across everyone who identifies themselves as a bioinformatician. A survey that is trying to ask something about gender bias no doubt introduces its own bias in the types of people who would be interested in completing such a survey.

But maybe Twitter can be of use in trying to determine a 'background' gender ratio among bioinformaticians. The evidence is hardly conclusive, but there are some data that suggests that more women use twitter than men. There's also data that there are comparable numbers of male/female users. In any case, numbers of users doesn't tell the whole story. Other research shows that, on average, men have  15% more followers than women, and a tool called Twee-Q that tries to identify the likely gender of twitter users, finds that men tend to be retweeted almost twice as often as women.

Despite gender biases in how people use twitter, it might still be useful to see what the gender ratio is of people who follow bioinformatics-type accounts. This is something that twitter can show you at analytics.twitter.com. However, this only seems to be enabled on accounts that have a certain number of followers. Here is what the results looks like for the @assemblathon twitter account (click to enlarge):

So twitter identifies — presumably using some sort of gender-guessing-algorithm — that 82% of  the followers are male. I'd love to see what other results look like for other bioinformatics twitter accounts. However, I think it is a better test if the accounts in question are themselves gender-neutral. I.e. affiliated to a resource or institution. If you run a bioinformatics-related twitter account that is gender-neutral, and if you can access analytics.twitter.com, I'd love it you could share your results with me (via comments below or on twitter @kbradnam).

101 questions with a bioinformatician #6: Mario Caccamo

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Mario Caccamo is the director of The Genome Analysis Centre, a BBSRC-sponsored research institute focused on genomics and computational biology. You may know of this institute by its shortened name 'TGAC' (of course, this is not the only place where you will see this initialism). As Director, Mario's role is to "ensure TGAC is equipped with the resources and people to deliver good science".

Mario's skills as a bioinformatician are matched only by his prowess on the volleyball court. When he used to play for the Informatics volleyball team at the Wellcome Trust Sanger Institute, he deservedly earned the nickname Super Mario. You can find out more about Mario by following him on twitter (@mcaccamo).

 

001. What's something that you enjoy about current bioinformatics research?

I see bioinformatics as a branch of molecular biology. I love the elegance of molecular biology. Research in bioinformatics is about capturing the beauty of biology in abstractions that can help us to discover new knowledge. Beauty here means complexity, optimisation, economy (in terms of information content) and functionality (among other things). One of the most exciting things about molecular biology is how young it is as a science. Our colleagues working in bioinformatics are the Newtons and the Keplers of molecular biology — so much to be done and discovered. 

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

The flip side is that we still don’t understand enough about the basic building blocks in molecular biology. The language we use to describe biological systems and processes is incomplete. We struggle with issues that sometimes look simple. What did we know about epigenetic modifications 10 years ago for instance? Little compared to what we know today. Bioinformaticians struggle with the incompleteness of the underlying basic knowledge and keep re-inventing the wheel leading to frustration. Perhaps these are growing pains — but pains nevertheless.

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

My advice would be: “This is good enough. Let it go.” Recognising when you are in the land of diminishing returns is a skill that should be taught at school. This is particularly relevant for bioinformatics. You can always close more gaps, find the missing gene or remove another false positive — you can do this for the next 10 years. My recommendation is...don't. Another perhaps more mundane recommendation is to learn either gawk, Perl one-liners, or some of the basic Unix command line tools to manipulate strings and text; they will give you the data you need for your best presentation the night before your talk. 

 

100. What's your all-time favorite piece of bioinformatics software, and why?

It has to be HMMER — a beautiful super efficient piece of software. I know that we shouldn’t use a hammer for all kind of different nails but somehow HMMER manages to prove that advice wrong. You can HMMER so many nails with this hammer.

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

I think I would take W. I like W as a strange letter (no explanation for that) — but it is A or T, alpha or omega in the nucleotide alphabet.

A new JABBA award for a particularly bogus bioinformatics acronym

JABBA is an acronym for 'Just Another Bogus Bioinformatics Acronym'. JABBA awards are given when people publish bioinformatics tools that have names which are tenuously derived acronyms (or initialisms). There have been many JABBA award winners, but this one might take some beating. This software tool was published earlier this year in the journal 'Microarrays'. Here is the title of the journal article: 

Pigeons: A Novel GUI Software for Analysing and Parsing High Density Heterologous Oligonucleotide Microarray Probe Level Data

So now the million dollar question is…what is 'Pigeons' an acronym of? Please sit down as  you might feel faint when you read this:

Photographically InteGrated En-suite for the OligoNucleotide Screening

Wow. This rates an 11 on the Bogus Scale (a scale that only goes up to 10).

What is the extent of gender bias in bioinformatics? Please help me find out.

I've been drawing up a short-list of people to interview for my 101 questions with a bioinformatician series, and I've realized that this list is skewed towards males (maybe 2:1). This partly reflects my own biases in choosing people that I know through work and from people that I follow on twitter. 

However, it probably also reflect underlying biases in the bioinformatics field as a whole. The existence of gender biases is STEM subjects is hardly a new concept (see here or here for some recent studies into this area) and anyone who follows Jonathan Eisen's blog will know that there is an all-too-common bias towards male speakers at scientific meetings. In a great blog post from last year (The Magnifying Glass Ceiling: The Plight of Women in Science), Jane Hu discusses the topic of gender bias in science. I encourage everyone to read this post, but I'll highlight one sentence here (emphasis mine):

It is true that women are underrepresented…but not because women aren’t interested in it or can’t handle the work.

Although projects like Girls Who Code and App Camp for Girls are doing a great job at increasing female participation in some STEM subjects, these projects will not help remove the discrimination against women that occurs later in their careers. Fortunately, other fantastic projects like Tools for Change: Boosting the retention of women in the STEM pipeline are helping raise awareness about these problems, and are offering solutions (e.g. encouraging more family friendly policies).

So I'm curious as to the extent of gender bias in bioinformatics. Please help me find out more by completing the really short form (below) and feel free to share this form with others (the Google form can be accessed separately via this link). I will report on the results in a future blog post. Also, I will make more effort to address any gender biases in my 101 questions with a bioinformatician series.


Ewan Birney's EBI press conference on being elected to the Royal Society

Speaker: And that concludes this EBI press conference to congratulate Ewan Birney on being elected to the Royal Society. We just have time for one or two questions. Ah okay...the first question goes to…Ewan Birney.

Ewan: Hi Ewan. Just wanted to say that this is all great and I've found your work to be really interesting. Can I just ask whether you've looked at the opportunity of widening this effort by joining other Royal Societies as well? This would allow for a much better comparative analysis of the scope and impact of Royal Society members? The Royal Statistical Society may be a good choice to begin with, or maybe the Royal Society of Marine Artists.

Ewan: Thanks Ewan, that's a really good question. It is something that I'm considering and I think there is a lot to gain from such a comparative approach. But to do this properly I think it needs to be part of a much larger effort. So I'm hopeful of trying to join every Royal Society and then see what can be learned from a cross-societal analysis of such memberships. Furthermore I'm hopeful that Her Majesty could be persuaded to start a new Royal Society for the Promotion of Questions by People Named Ewan at Academic Conferences…something that is very near and dear to my heart.

Speaker: Okay, I think we have time for just one more question. Oh, Ewan…again.

Ewan: Just to follow up Ewan, given the advanced age of many Royal Society members, have you thought about trying to assess what fraction of the Royal Society is functional?

Ewan: That's a fantastic question Ewan, very perceptive of you. This is something else that I have a strong interest in. I am currently involved in some preliminary discussions with various people to form a new pan-European working group that will investigate how much of the Royal Society is functional. This effort will hopefully be called ENCODEMBLIXIR…or something snappy like that. 

 

Jesting aside, congratulations Ewan this is great news!

101 questions with a bioinformatician #5: Laura Clarke

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Laura Clarke is the Project Coordinator for Resequencing Informatics, part of the Vertebrate Genomics team led by Paul Flicek at EMBL-EBI. Before joining the EBI, she was applying her considerable bioinformatics skills at the Wellcome Trust Sanger Institute (a move ranked #1 on the annual list of Easiest-employers-to-transition-between). 

Her role sees her help with the analysis and coordination of high throughput genomics efforts such as the 1000 Genomes project, BLUEPRINT (deciphering the epigenome of blood cells), and HipSci (the Human Induced Pluripotent Stem Cells Initiative). If you're wondering what this actually entails, I'll hand you over to Laura:

"This work boils down to making sure that data gets into and out of the sequence archives; running primary analysis and QC; and then making sure the resulting analysis makes it out to the community".

You can find out more about Laura by following her on twitter (@laurastephen), and of course you can also follow @blueprint_eu and @hipsci. And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

The possibility. With modern sequencing technologies, computation techniques have the ability to draw together these new data types and massive volumes of data, allowing us to get much closer to a proper understanding of cellular biology, which of course brings us closer to understanding organismal biology.

Add to that the diverse range of species being sequenced and what that can teach us about evolution and the forces which drive evolution.

That is of course before you consider how it might impact medicine or food security or any real world applications.

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

Extracting data from people. My life would be easier if people weren't so begrudging about sharing data and describing the data they do share well. I work with many people who do share data freely and easily but there are still too many people who are too reticent or reluctant to make data publicly available from within a consortium.

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

For data coordination purposes we produce a lot of tab-delimited text files, cut is a wonderful Unix command for making those easier to work with and manipulate, learning about cut sooner would have at least made mucking about with various types of GFF files easier I suspect.

 

100. What's your all-time favorite piece of bioinformatics software, and why?

I have to say I did enjoy pairedends.com, very funny

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

R: this is because Adenine and Guanine are the same molecule type (purines) as both Theobromine and Caffeine, both of which are quite important to me and at least influence my personality.