The Take-Home Message takes on genome size
The latest issue of our Take-Home Message comic poses an interesting question:
What if body size equalled genome size?
The latest issue of our Take-Home Message comic poses an interesting question:
What if body size equalled genome size?
This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.
Gene Myers is a Director at the Max-Planck Institute for Molecular Cell Biology
and Genetics (MPI-CBG) and the Klaus-Tschiar Chair of the Center for Systems Biology Dresden (CSBD).
Maybe you've heard of Gene for his pivotal role in developing the Celera genome assembler which led to genome assemblies for mouse, human, and drosophila (the first whole genome shotgun assembly of a multicellular organism). You may also know Gene from his work in helping develop a fairly obscure bioinformatics tool that no-one uses (just the 58,000 citations in Google Scholar).
His current research focuses on developing new methods for microscopy and image analysis; from his research page:
"The overarching goal of our group is to build optical devices, collect molecular reagents, and develop analysis software to monitor in as much detail as possible the concentration and localization of proteins, transcripts, and other entities of interest within a developing cohort of cells for the purpose of [developing] a biophysical understanding of development at the level of cell communication and force generation."
You can find out more about Gene by visiting his research page on the MPI-CBG website or by following him on Twitter (@TheGeneMyers). Finally, if you are interested in genome assembly then you may also want to check out his dazzlerblog ('The Dresden AZZembLER for long read DNA projects'). And now, on to the 101 questions...
001. What's something that you enjoy about current bioinformatics research?
The underlying technology is always changing and presenting new challenges, and the field is still evolving and becoming more "sophisticated". That is, there are still cool unsolved problems to explore despite the fact that some core aspects of the field, now in its middle-age in my view, are "overworked".
010. What's something that you don't enjoy about current bioinformatics research?
I'm really bored with networks and -omics. Stamp collecting large parts lists seems to have become the norm despite the fact that it rarely leads to much mechanistic insight. Without an understanding of spatial organization and soft-matter physics, most important biological phenomenon cannot be explained (e.g. AP axis orientation at the outset of worm embryogenesis).
Additionally, I was disgusted with the short-read DNA sequencers that, while cheap, produce truly miserable reconstructions of novel genomes. Good only for resequencing and digital gene expression/transcriptomics. Thank God for the recent emergence of the long-read machines.
011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?
At age 18 its not so much about career specifics but one's general approach to education. For myself, I would have said, "go to class knuckle head and learn something from all the great researchers that are your teachers (instead of hanging out in your dorm room reading text books)", and for general advice to all at that stage I would say, learn mathematics and programming now while your mind is young and supple, you can acquire a large corpus of knowledge about biological processes later.
100. What's your all-time favorite piece of bioinformatics software, and why?
I don't use bioinformatics software, I make it :-) My favorite problem, yet fully solved in my opinion, is DNA sequence assembly -- it is a combinatorially very rich string problem.
101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?
N — as it encompasses all the rest :-)
In the latest issue of The Take-Home Message — the semi-regular web comic that discusses news and issues from the world of biology — we try to explain the concept of monotypic taxons:
This may make the aardvark one of the loneliest mammals on the planet…taxonomically speaking of course
Yesterday was the Annual Student Conference at The Institute of Cancer Research, London. As part of the ICR's Communications team, we helped run a session about the myriad ways that science can (and should) be communicated more effectively.
During this session my colleague Rob Dawson (@BioSciFan on Twitter) introduced a fun tool called the The Up-Goer Five Text Editor. This tool lets you edit text…but only by using the 1,000 most common words in the English language.
It was inspired by an XKCD comic which used the same approach to try to explain how an Apollo moon rocket works. Using this tool really makes you appreciate that just about every scientific word you might use is not on the list. So it is a good way of making you think about how to communicate science to a lay audience, completely free of jargon.
I thought I would have a go at explaining genomics. I couldn't even use the words 'science', 'machine', or 'blueprint' (let alone 'gene', 'DNA', or 'molecule'). Here is my attempt:
In every cell of our bodies, there is a written plan that explains how that cell should make all of the things that it needs to make. A cell that grows hair is very different to a cell that is in your heart or brain. However, all cells still have the same plan but different parts of the plan are turned on in different cells.
We first understood what the full plan looks like for humans in 2003. We can use computers to make sense of the plan and to learn more about how many parts are needed to make a human (about 20,000). The better we understand the plan, the more we might be able to make human lives better.
You can edit my version online but I encourage people to try explaining your own field of work using this tool.
This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.
Keith Robison is a Senior Bioinformatics Scientist at a small biotechnology company based in Cambridge, Massachusetts. His employer has an interest in the natural products drug discovery space and as Keith puts it, his own work concerns 'Assembling and analyzing actinomycete genomes to reveal their biosynthetic moxie'.
If you didn't already know — and shame on you if that is the case — Keith writes about developments in sequencing technologies (and other topics) on his Omics! Omics! blog. This is required reading for anyone interested in trying to understand the significance of the regular announcements made by various companies that develop sequencing technologies. In particular, his analysis of news coming out from the annual AGBT conference is typically detailed and insightful.
You can find out more about Keith by reading his aforementioned blog or by following him on twitter (@OmicsOmicsBlog). A special thanks to Keith for waiting patiently on me to get this interview posted! And now, on to the 101 questions...
001. What's something that you enjoy about current bioinformatics research?
All sorts of re-thinking how to do things — productive ways to look at old problems. Look at all the exciting improvements in assembly coming from long reads, or alignment-free RNA-Seq and metagenomics. Cool stuff.
010. What's something that you don't enjoy about current bioinformatics research?
Too many papers that report a new program without adequate benchmarking or a clear description of what differentiates the program — is it really different, or just old wine in new bottles?
011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?
Wow. I didn't dabble into bioinformatics until I was 19. I think my advice would be try out a new programming language every other year — I've gotten a lot of mileage out of a few languages, but even learning a new one (that I subsequently drop) productively influences my programming.
100. What's your all-time favorite piece of bioinformatics software, and why?
My favorite bioinformatics software was the original WWW interface to FlyBase — first: because I wrote it as a lark, second: FlyBase paid me to support it after I showed it off, and third: because its one of the few programs of mine that ever had an explicit sunset!
101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?
M — Methionine is good at getting things started (KRB: yes I know, Methionine is not an IUPAC nucleotide character…but that was the given answer to the question).
All good things come to an end…and, more importantly, all bad things come to an end. For that reason, I have, with some sadness, decided to bring my series of JABBA awards (Just Another Bogus Bioinformatics Acronym) to an end.
This is partly because my new job means that I am no longer a bioinformatician. It is also partly because it seems that the flood of bogus bioinformatics acronyms will never cease.
I've tried campaigning to raise awareness of why these acronyms are often awkward, tenuous, and generally unhelpful to the wider community. Hopefully, I've made some of you think about naming your software just a little bit.
I can't go without presenting you with a bumper crop of recently minted JABBA awards…
Kicking us off, from BMC Bioinformatics we have a paper titled SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis. This is not excessively bogus, but omitting any contributions to the acronym from the words 'reference-based bacterial' is what gets this earns a tool a JABBA award. Oh, and don't confuse this tool with the 2014 bioinformatics tool called sPARTA. Who would make that mistake?
From Nucleic Acids Research we have the following paper…DIDA: A curated and annotated digenic diseases database. DIDA is derived from DIgenic diseases Database. Never a good sign when an acronym only takes letters from two of the three words. Also never a good sign when there is a completely different piece of bioinformatics software that uses the same name (although I think the software mentioned here may have been around first).
From Genome Research we have a new paper: SCRaMbLE generates designed combinatorial stochastic diversity in synthetic chromosomes. SCRaMbLE is derived from Synthetic Chromosome Rearrangement and Modification By LoxP-mediated Evolution. They really could have just gone for 'SCRAMBLE' (all upper-case) as it would be a legitimate acronym. However, my dislike of this name is because it is just a little too tenuous. Oh, and the fact that is already a bioinformatics tool called Scramble.
Next up, from the journal BMC Bioinformatics we have NEAT: a framework for building fully automated NGS pipelines and analyses. NEAT derives from NExt generation Analysis Toolbox. Leaving aside the general issue of how I feel about NGS as a phrase, this name is bogus for taking two letters from 'next' and none from 'generation'. Oh, and there is also a bioinformatics tool called NeAT.
From the journal Bioinformatics we have…ParTIES: a toolbox for Paramecium interspersed DNA elimination studies. Let's break that acronym down: PARamecium Toolbox for Interspersed dna Elimination Studies. As I've always said, ain't no party like a Paramecium party.
Okay, are you sitting down? Next we have a paper published in the journal Bioinformatics. The title is going to make you very curious…SUPER-FOCUS: a tool for agile functional analysis of shotgun metagenomic data. Surely nobody was seriously going to try to make an acronym called SUPER-FOCUS? Oh wait they have…SUbsystems Profile by databasE Reduction using FOCUS. Any time you have the word 'database' as part of your software name and you choose to use just the last letter of this word…that's a bogus acronym, or should I say SUPER-BOGUS?
That is your lot. I reserve the right to maybe come back with one more JABBA-related post to present my top 10 JABBA awards. I'll end with a brief summary of the advice that I've tried to impart many times before:
Time for a new JABBA award. This one comes from a group based at the Department of Statistics in Oxford, UK. The paper was published recently in the journal Bioinformatics:
I think they're trying a little too hard to make a clever acronym here:
ANARCI: Antigen receptor Numbering And Receptor ClassificatIon
It's fun but just a little too tenuous for my liking, and so it merits a JABBA award.
On my personal blog, I finally spill the beans about my new job (which starts today)!
I'll just reiterate here that I plan for the ACGT blog to continue, but maybe it will change a little (with more posts on science communication and fewer posts overall). I'll have a better idea of what I want to do with this site in the next few weeks.
On November 20th, on the last day of my employment at UC Davis, I gave an exit seminar. Jenna Gallegos, a PhD student at UC Davis — who works on the awesome Intron-Mediated Enhancement (IME) project under the supervision of Alan Rose — posted several tweets from my talk including this photo of one of my slides:
In case anyone is wondering how difficult it is to assemble a #genome here's a great analogy by @kbradnam #kexit pic.twitter.com/Gh6qPNzVGT
— Jenna E Gallegos (@FoodBeerScience) November 20, 2015
This tweet continued to generate interest (retweets, likes, and mentions) for most of the 20th November and for many subsequent days afterwards. The latest retweet of this tweet was today: 16 days after the original tweet! I find this amazing especially as the original slide deals with the topic of genome assembly. At the time of writing the tweet has had 369 retweets and 277 likes
I'm pleased that people have found my jigsaw analogy useful. Some people commented that this isn't the best possible analogy and pointed out various ways that it could be more technically accurate (including suggestions of shredding copies of books and trying to piece together the original).
While I accept that this isn't the most scientific way of depicting the many problems and challenges of genome assembly, it is hopefully an accessible way of illustrating the problem. Nearly everyone has tried putting a jigsaw together, but not everyone has tried reconstituting a shredded book. My exit seminar was aimed at a very broad audience and so I pitched this slide accordingly.
People can follow Jenna on twitter (@FoodBeerScience) and should, at the very least, check out her awesome twitter bio. If you want to know more about her work, here is a recent review of IME that she wrote:
A new paper has been published in the journal BMC Research Notes:
This name is:
The example included with LASER installation can be run as:
./quast.py testdata/contigs1.fasta testdata/contigs2.fasta \ -R testdata/reference.fasta.gz -G testdata/genes.txt \ -O test_data/operons.txt
The output of LASER program can be viewed in file: ./quast_results/latest/report.txt
So to run LASER just type 'quast'!