101 questions with a bioinformatician #38: Gene Myers

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Gene Myers is a Director at the Max-Planck Institute for Molecular Cell Biology
and Genetics
(MPI-CBG) and the Klaus-Tschiar Chair of the Center for Systems Biology Dresden (CSBD).

Maybe you've heard of Gene for his pivotal role in developing the Celera genome assembler which led to genome assemblies for mouse, human, and drosophila (the first whole genome shotgun assembly of a multicellular organism). You may also know Gene from his work in helping develop a fairly obscure bioinformatics tool that no-one uses (just the 58,000 citations in Google Scholar).

His current research focuses on developing new methods for microscopy and image analysis; from his research page:

"The overarching goal of our group is to build optical devices, collect molecular reagents, and develop analysis software to monitor in as much detail as possible the concentration and localization of proteins, transcripts, and other entities of interest within a developing cohort of cells for the purpose of [developing] a biophysical understanding of development at the level of cell communication and force generation."

You can find out more about Gene by visiting his research page on the MPI-CBG website or by following him on Twitter (@TheGeneMyers). Finally, if you are interested in genome assembly then you may also want to check out his dazzlerblog ('The Dresden AZZembLER for long read DNA projects'). And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

The underlying technology is always changing and presenting new challenges, and the field is still evolving and becoming more "sophisticated". That is, there are still cool unsolved problems to explore despite the fact that some core aspects of the field, now in its middle-age in my view, are "overworked".



010. What's something that you don't enjoy about current bioinformatics research?

I'm really bored with networks and -omics. Stamp collecting large parts lists seems to have become the norm despite the fact that it rarely leads to much mechanistic insight. Without an understanding of spatial organization and soft-matter physics, most important biological phenomenon cannot be explained (e.g. AP axis orientation at the outset of worm embryogenesis).

Additionally, I was disgusted with the short-read DNA sequencers that, while cheap, produce truly miserable reconstructions of novel genomes. Good only for resequencing and digital gene expression/transcriptomics. Thank God for the recent emergence of the long-read machines.



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

At age 18 its not so much about career specifics but one's general approach to education. For myself, I would have said, "go to class knuckle head and learn something from all the great researchers that are your teachers (instead of hanging out in your dorm room reading text books)", and for general advice to all at that stage I would say, learn mathematics and programming now while your mind is young and supple, you can acquire a large corpus of knowledge about biological processes later.



100. What's your all-time favorite piece of bioinformatics software, and why?

I don't use bioinformatics software, I make it :-) My favorite problem, yet fully solved in my opinion, is DNA sequence assembly -- it is a combinatorially very rich string problem.



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

N — as it encompasses all the rest :-)

How would you describe genomics without using any scientific jargon?

Yesterday was the Annual Student Conference at The Institute of Cancer Research, London. As part of the ICR's Communications team, we helped run a session about the myriad ways that science can (and should) be communicated more effectively.

During this session my colleague Rob Dawson (@BioSciFan on Twitter) introduced a fun tool called the The Up-Goer Five Text Editor. This tool lets you edit text…but only by using the 1,000 most common words in the English language.

It was inspired by an XKCD comic which used the same approach to try to explain how an Apollo moon rocket works. Using this tool really makes you appreciate that just about every scientific word you might use is not on the list. So it is a good way of making you think about how to communicate science to a lay audience, completely free of jargon.

I thought I would have a go at explaining genomics. I couldn't even use the words 'science', 'machine', or 'blueprint' (let alone 'gene', 'DNA', or 'molecule'). Here is my attempt:

In every cell of our bodies, there is a written plan that explains how that cell should make all of the things that it needs to make. A cell that grows hair is very different to a cell that is in your heart or brain. However, all cells still have the same plan but different parts of the plan are turned on in different cells.

We first understood what the full plan looks like for humans in 2003. We can use computers  to make sense of the plan and to learn more about how many parts are needed to make a human (about 20,000). The better we understand the plan, the more we might be able to make human lives better.

You can edit my version online but I encourage people to try explaining your own field of work using this tool.

101 questions with a bioinformatician #37: Keith Robison

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Keith Robison is a Senior Bioinformatics Scientist at a small biotechnology company based in Cambridge, Massachusetts. His employer has an interest in the natural products drug discovery space and as Keith puts it, his own work concerns 'Assembling and analyzing actinomycete genomes to reveal their biosynthetic moxie'.

If you didn't already know — and shame on you if that is the case — Keith writes about developments in sequencing technologies (and other topics) on his Omics! Omics! blog. This is required reading for anyone interested in trying to understand the significance of the regular announcements made by various companies that develop sequencing technologies. In particular, his analysis of news coming out from the annual AGBT conference is typically detailed and insightful.

You can find out more about Keith by reading his aforementioned blog or by following him on twitter (@OmicsOmicsBlog). A special thanks to Keith for waiting patiently on me to get this interview posted! And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

All sorts of re-thinking how to do things — productive ways to look at old problems. Look at all the exciting improvements in assembly coming from long reads, or alignment-free RNA-Seq and metagenomics. Cool stuff.



010. What's something that you don't enjoy about current bioinformatics research?

Too many papers that report a new program without adequate benchmarking or a clear description of what differentiates the program — is it really different, or just old wine in new bottles?



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Wow. I didn't dabble into bioinformatics until I was 19. I think my advice would be try out a new programming language every other year — I've gotten a lot of mileage out of a few languages, but even learning a new one (that I subsequently drop) productively influences my programming.



100. What's your all-time favorite piece of bioinformatics software, and why?

My favorite bioinformatics software was the original WWW interface to FlyBase — first: because I wrote it as a lark, second: FlyBase paid me to support it after I showed it off, and third: because its one of the few programs of mine that ever had an explicit sunset!



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

M — Methionine is good at getting things started (KRB: yes I know, Methionine is not an IUPAC nucleotide character…but that was the given answer to the question).

The last ever awards for Just Another Bogus Bioinformatics Acronym (JABBA)

jabba logo.png

All good things come to an end…and, more importantly, all bad things come to an end. For that reason, I have, with some sadness, decided to bring my series of JABBA awards (Just Another Bogus Bioinformatics Acronym) to an end.

This is partly because my new job means that I am no longer a bioinformatician. It is also partly because it seems that the flood of bogus bioinformatics acronyms will never cease.

I've tried campaigning to raise awareness of why these acronyms are often awkward, tenuous, and generally unhelpful to the wider community. Hopefully, I've made some of you think about naming your software just a little bit.

I can't go without presenting you with a bumper crop of recently minted JABBA awards…

  1. Kicking us off, from BMC Bioinformatics we have a paper titled SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis. This is not excessively bogus, but omitting any contributions to the acronym from the words 'reference-based bacterial' is what gets this earns a tool a JABBA award. Oh, and don't confuse this tool with the 2014 bioinformatics tool called sPARTA. Who would make that mistake?

  2. From Nucleic Acids Research we have the following paper…DIDA: A curated and annotated digenic diseases database. DIDA is derived from DIgenic diseases Database. Never a good sign when an acronym only takes letters from two of the three words. Also never a good sign when there is a completely different piece of bioinformatics software that uses the same name (although I think the software mentioned here may have been around first).

  3. From Genome Research we have a new paper: SCRaMbLE generates designed combinatorial stochastic diversity in synthetic chromosomes. SCRaMbLE is derived from Synthetic Chromosome Rearrangement and Modification By LoxP-mediated Evolution. They really could have just gone for 'SCRAMBLE' (all upper-case) as it would be a legitimate acronym. However, my dislike of this name is because it is just a little too tenuous. Oh, and the fact that is already a bioinformatics tool called Scramble.

  4. Next up, from the journal BMC Bioinformatics we have NEAT: a framework for building fully automated NGS pipelines and analyses. NEAT derives from NExt generation Analysis Toolbox. Leaving aside the general issue of how I feel about NGS as a phrase, this name is bogus for taking two letters from 'next' and none from 'generation'. Oh, and there is also a bioinformatics tool called NeAT.

  5. From the journal Bioinformatics we have…ParTIES: a toolbox for Paramecium interspersed DNA elimination studies. Let's break that acronym down: PARamecium Toolbox for Interspersed dna Elimination Studies. As I've always said, ain't no party like a Paramecium party.

  6. Okay, are you sitting down? Next we have a paper published in the journal Bioinformatics. The title is going to make you very curious…SUPER-FOCUS: a tool for agile functional analysis of shotgun metagenomic data. Surely nobody was seriously going to try to make an acronym called SUPER-FOCUS? Oh wait they have…SUbsystems Profile by databasE Reduction using FOCUS. Any time you have the word 'database' as part of your software name and you choose to use just the last letter of this word…that's a bogus acronym, or should I say SUPER-BOGUS?

That is your lot. I reserve the right to maybe come back with one more JABBA-related post to present my top 10 JABBA awards. I'll end with a brief summary of the advice that I've tried to impart many times before:

  1. Not all software needs to have an acronym…you could choose to call your novel transcriptome validator 'Keith' rather than tenuously coming up with KEITH: Kmer-Enriched Inspection of Transcript deptH.
  2. Preferably, do not name your acronym after animals …especially when your software has no connection with that animal.
  3. Check: has anyone else has used that name before? Search Google with your intended name plus the word 'bioinformatics'.
  4. Check: is your name pronounceable? Tell the name to your parents over the phone and ask them if they can spell it correctly.
  5. Check: are you using random capitalisation to be cool (or 'KeWL' even)? Will other people who reference your software likely bother to use the italicised superscript font that you unwisely chose to use for every other letter in your software name?

ANARCI in the UK: time for a new JABBA award

Time for a new JABBA award. This one comes from a group based at the Department of Statistics in Oxford, UK. The paper was published recently in the journal Bioinformatics:

I think they're trying a little too hard to make a clever acronym here:

ANARCI: Antigen receptor Numbering And Receptor ClassificatIon

It's fun but just a little too tenuous for my liking, and so it merits a JABBA award.

And the award for the most-retweeted-tweet-of-a-photo-of-a-slide-from-a-presentation-of-mine goes to…

On November 20th, on the last day of my employment at UC Davis, I gave an exit seminar. Jenna Gallegos, a PhD student at UC Davis — who works on the awesome Intron-Mediated Enhancement (IME) project under the supervision of Alan Rose — posted several tweets from my talk including this photo of one of my slides:

This tweet continued to generate interest (retweets, likes, and mentions) for most of the 20th November and for many subsequent days afterwards. The latest retweet of this tweet was today: 16 days after the original tweet! I find this amazing especially as the original slide deals with the topic of genome assembly. At the time of writing the tweet has had 369 retweets and 277 likes

I'm pleased that people have found my jigsaw analogy useful. Some people commented that this isn't the best possible analogy and pointed out various ways that it could be more technically accurate (including suggestions of shredding copies of books and trying to piece together the original).

While I accept that this isn't the most scientific way of depicting the many problems and challenges of genome assembly, it is hopefully an accessible way of illustrating the problem. Nearly everyone has tried putting a jigsaw together, but not everyone has tried reconstituting a shredded book. My exit seminar was aimed at a very broad audience and so I pitched this slide accordingly.

People can follow Jenna on twitter (@FoodBeerScience) and should, at the very least, check out her awesome twitter bio. If you want to know more about her work, here is a recent review of IME that she wrote:

Finding bogus bioinformatics acronyms sometimes requires a laser-like focus

jabba logo.png

A new paper has been published in the journal BMC Research Notes:

This name is:

  1. Bogus — the word 'genome' doesn't contribute any letters to 'LASER' and two letters ('S' and 'R') are not derived from the initial letters of words.
  2. Duplicated — there are at least two other bioinformatics tools called LASER (see here and here).
  3. Undiscoverable — you really need to search Google for LASER genome assembly before you see this as a top result.
  4. Ambiguous — large is a very subjective term. The authors imply that LASER is suitable for human genomes. These are larger than some genomes but smaller than others.
  5. Inconsistent — the paper reveals that LASER is built on the code of QUAST (Quality Assessment Tool for Genome Assemblies). This means you end up with the somewhat bizarre documentation for how to run the program called LASER:

The example included with LASER installation can be run as:

./quast.py testdata/contigs1.fasta testdata/contigs2.fasta \ -R testdata/reference.fasta.gz -G testdata/genes.txt \ -O test_data/operons.txt

The output of LASER program can be viewed in file: ./quast_results/latest/report.txt

So to run LASER just type 'quast'!

Learn my Linux Bootcamp…all from within a web browser window

I awoke yesterday to see a lot of twitter notifications on my phone. Sometimes this happens when I've written a post on this blog, but I hadn't added anything for over a week. Turns out that the activity was triggered by this tweet by Richard Smith-Unna (@blahah404 on twitter):

As the screenshot below indicates, Richard has worked some amazing black magic to enable a single browser window to contain a fully interactive terminal as well as a file viewer/navigator; all alongside a (slightly modified) version of my original Linux bootcamp material.

Click to enlarge

This new interactive command-line bootcamp is a wonderful resource and means that the only barrier to learning some simple, but powerful, Linux/Unix commands is the availability of a web browser.

Richard explains a little about how he put all of this together:

The Infrastructure, including adventure-time and docker-browser-server, was built by @maxogden and @mafintosh. The setup of this app was based on the get-dat adventure.

Slides from my exit seminar

This morning I gave my last presentation at UC Davis. My highly informal exit seminar was a great opportunity to reflect on some of the many projects I've been involved with over the last decade here at Davis. Thank you to all who came, and a special thanks to Ian Korf for his kind introduction.

I include the slides below, but note that some of these slides won't make much sense without the narration (and you also get to miss out on two embedded videos). There was some video recorded via the Periscope app, but I found out today that Periscope only keeps video around for 24 hours, so unfortunately if you didn't watch the video when you had the chance it is now lost.


2015-11-21 12.34: Updated to reflect that Periscope video content is no longer available.

JABBA vs Jabba: when is software not really software?

It was only a matter of time I guess. Today I was alerted to a new publication by Simon Cockell (@sjcockell), it's a book chapter titled:

From the abstract:

Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data

Now as far as I can tell, this Jabba is not an acronym, so we safely avoid the issue of presenting a JABBA award for Jabba. However, one might argue that naming any bioinformatics software 'Jabba' is going to present some problems because this is what happens when you search Google for 'Jabba bioinformatics'.

There is a bigger issue with this paper that I'd like to address though. It is extremely disappointing to read a software bioinformatics paper in the year 2015 and not find any explicit link to the software. The publication includes a link to http://www.ibcn.intec.ugent.be, but only as part of the author details. This web page is for the Internet Based Communication Networks and Services research group at the University of Gent. The page contains no mention of Jabba, nor does their 'Facilities and Tools' page, nor does searching their site for Jabba.

Initially I wondered if this is paper is more about the algorithm behind Jabba (equations are provided) and not about an actual software implementation. However, the paper includes results from their Jabba tool in comparison to another piece of software (LoRDEC) and includes details of CPU time and memory requirements. This suggests that the Jabba software exists somewhere.

To me this is an example of 'closed science' and represents a failure of whoever reviewed this article. I will email the authors to find out if the software exists anywhere…it's a crazy idea but maybe they might be interested if people could, you know, use their software.

Update 2015-11-20: I heard back from the authors…the Jabba software is on GitHub.

ACGT is now AFCW (Approved for Free Cultural Works): thoughts on switching to a CC-BY license

This website, as well as my personal website and Rescued by Code, licenses material under a Creative Commons license. Specifically, I've been using the Attribution Non-Commerical license, popularly known as CC BY-NC. My joint venture with Abby Yu, The Take-Home Message web comic, has been even more restrictive and has been licensing content under the Attribution Non-Commercial Share-Alike license (CC BY-NC-SA).

These choice of licenses is something that's been on my mind for a while. I've known that I'm not being as open as I could be and maybe this has stemmed from an unwarranted (not to mention unlikely) fear that someone would take all my blog posts and somehow seek to profit from them.

Today I saw a tweet by Rogier Kievit (@rogierK) that has helped me change my mind:

I found the third link — something that is now over a decade old — particularly persuasive and accordingly I have switched all of my website licenses to CC-BY. Apparently this means that all of my writings now fall into the category of Free Cultural Works. I am grateful to Abby Yu to agreeing to this change for The Take-Home Message.

This change also means that someone can now use my blog posts to write the definitive book on JABBA-awards…just as long as they give me appropriate attribution.

101 questions with a bioinformatician #36: Alicia Oshlack

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Alicia Oshlack is the Head of Bioinformatics at the Murdoch Childrens Research Institute (they don't like apostrophes) in Melbourne, Australia. Her research focuses on four main project areas: methods for analysing RNA-seq data, epigenomics, clinical genomics data analysis, and cancer genomics.

Before moving into the field of genomics, Alicia had a background in astronomy and her Ph.D. work concerned the structure of radio quasars. Not many bioinformaticians can claim to have published papers on the topic of estimating the mass of black holes!

You can find out more about Alicia by reading her Wikipedia page or by following her on twitter (@AliciaOshlack). I also encourage you to check out her must read article for fellow computational biologists: A 10-step guide to party conversation for bioinformaticians. And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

I love the pace at which things are changing in the field. There is always something new to work on and there are so many ways to contribute something useful to the research community. I also really love the balance between collaborative analysis on really interesting biological problems and doing careful methods development.



010. What's something that you don't enjoy about current bioinformatics research?

I get frustrated that I need to spend so much of my time convincing people that bioinformatics is a real scientific research discipline where we have deep scientific training and use our brains to solve scientific problems. Hopefully I will have convinced everyone in Australia soon.



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

I did my PhD in astrophysics and I often wonder if I would have been better off doing a more relevant subject but I really appreciate the skills I learnt doing that. Within this I probably would tell myself to put a bit more focus on programming and do statistics instead of applied mathematics.



100. What's your all-time favorite piece of bioinformatics software, and why?

I think limma is amazing. Have you seen the users guide? I think it's 145 pages long and although it was originally developed for microarray analysis more than 12 years ago it has adapted to the sequencing revolution and is used more than ever now. I believe it is the most widely used bioconductor analysis package ever.



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

I think S = G/C because I'm always a little bit biased.

Beautiful logo redesign as part of the rebranding of Crossref

Crossref — the non-profit organization that helps make academic content easier to find, link, cite and assess — has today announced a rebranding. They will be announcing new names and new logos for all of their products, and the Crossref logo itself gains a beautiful looking new design. So we say 'goodbye' to this:

 

And 'hello' to this lovely logo:

 

The explanation for why they wanted to change the logo makes a lot of sense to me:

We needed an icon to give more flexibility across the web that a word mark cannot do alone. The icon is made up of two interlinked angle brackets familiar to those who work with metadata, and can also act as arrows depicting Metadata In and Metadata Out, two themes under which our services can generally be grouped.

As part of this rebranding, they are formalizing a change from CrossRef to Crossref (with lower case 'R'). Someone had a fun job updating their Wikipedia page:

Wikipedia edit history: CrossRef > Crossref. Click to enlarge.

Assemble a genome and evaluate the result [Link]

There is a new page on the bioboxes site (such a great name!) which details how bioboxes can be used to assemble a genome and then evaluate the results:

A common task in genomics is to assemble a FASTQ file of reads into a genome assembly and followed by evaluating the quality of this assembly. This recipe will explore using bioboxes to do this task.

A third Assemblathon contest came very close to launching earlier this year…except that it didn't — maybe this will be the subject of a future blog post! — and we planned to make biobox containers a requisite part of submitting assemblies. If Assemblathon 3 ever gets off the ground I feel happier knowing that the bioboxes team is doing so much great work that will make running such a contest easier to manage.

Time to toggle the JABBA-award status of this bioinformatics software name

Give me a B.
Give me a O.
Give me a G.
Give me a U.
Give me a S.

What have you got?

Another BOGUS bioinformatics acronym! This time it is courtesy of the journal BMC Bioinformatics:

I think you can already see why this one is going to win a JABBA award. The name 'TOGGLE' derives from TOolbox for Generic nGs anaLysEs. Using their same strategy, they could have also gone for BOGGLE, BONNY, or even BORINGLY.

How to ask for bioinformatics help online

Part two of a two-part series.

In part one I covered where to ask for bioinformatics help. Now it is time to turn to the issue of how you should go about asking for help. Hat tip to reader Venu Thatikonda (@nerd_yie) for pointing me out to this 2011 PLOS Computational Biology article that tackles similar ground to this blog post. Here are my five main suggestions, with the last one being further broken down into 9 different tips:

  1. Be polite. Posting a question to an online forum does not mean that you deserve to be answered. If people do answer, they are usually doing so by giving up their own free time to try to help you. Don't berate people for their answers, or insult them in any way.
  2. Be relevant. Choose the right forum in which to ask your question. Sites like SEQanswers have different forums that discuss particular topics, so don't post your PacBio question in the Ion Torrent forum.
  3. Be aware of the rules. Most online forums will have some rules, guidelines, and/or an FAQ which covers general posting etiquette and other things that you should know. It is a good idea to check this before posting on a site for the first time.
  4. Be clever. Search the forum before asking your question, there is often a good chance that your question has already been asked (and answered) by others.
  5. Be helpful. The biggest thing you can probably do in order to get a useful answer to a question is to provide as many useful details as possible, these include:
    1. Type of operating system and version number, e.g. Mac OS X 10.10.5.
    2. Version number/name of software tool(s) you are using, e.g. NCBI BLAST+ v2.2.26, Perl v5.18.2 etc. A good bioinformatics or Unix tool will have a -v, -V, or --version command-line option that will give you this information.
    3. Any error message that you saw. Report the full error message exactly as it appeared.
    4. Where possible, provide steps that would let someone else reproduce the problem (assuming it is reproducible).
    5. Outline the steps that you have tried, if any, to fix the problem. Don't wait for someone to suggest 'quit and restart your terminal' before you reply 'Already tried that'.
    6. A description of what you were expecting to happen. Some perceived errors are not actually errors at all (the software was doing exactly what was asked of it, though this may not be what the user was expecting).
    7. Any other information that could help someone troubleshoot your problem, e.g. a listing of your Unix terminal before and/or after you ran a command which caused a problem.
    8. A snippet of your data that would allow others to reproduce the problem. You may not be able to upload data to the website in question, but small data snippets could be shared via a Dropbox or Google Drive link, or on sites like Github gist.
    9. Attach a screenshot that illustrates the problem. Many forum sites allow you to add image files to a post.

Any other suggestions?

 

Updates

2015-11-08 09.44: Added link to PLOS Computational Biology article