Kablammo: an interactive, web-based BLAST results visualizer →

December 11, 2014 by Keith Bradnam

Another great name for a piece of bioinformatics software! This tool has just been published in the journal Bioinformatics by Jeff Wintersinger and James Wasmuth. From the abstract:

Kablammo is a web-based application that produces interactive, vector-based visualizations of sequence alignments generated by BLAST. These visualizations can illustrate many features, including shared protein domains, chromosome structural modifications, and genome misassembly.

101 questions with a bioinformatician #20: Roy Chaudhuri

December 11, 2014 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

Roy Chaudhuri is a Lecturer in Bioinformatics in the Department of Molecular Biology and Biotechnology at the University of Sheffield, and is part of the Sheffield Bioinformatics Hub. Roy's expertise concerns the comparative genomics and phylogenetics of bacterial pathogens and in a previous life he helped set up the coliBASE and xBASE databases. In a previous-previous life he was also a pioneering website designer (I shouldn't judge: people in glass houses… and all that).

He claims that his current duties involve "research, teaching, publishing, and trying to convince people to give me money". If you would like to give Roy money (perhaps a £1 donation towards his Eccles Cake fund?), you can get in contact with him via the Sheffield Bioinformatics Hub website. You can also find out more about Roy by following him on twitter (@RoyChaudhuri)…but be warned, he is a non-stop tweeter! And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I like that after 16 years as a bioinformatician, I'm still learning new things every day, and that there's no shortage of cool datasets and interesting problems to keep me busy. I also like how far it's possible to get by knowing a little bit of biology and a little Perl.

010. What's something that you don't enjoy about current bioinformatics research?

I worry that too much community effort has been devoted to dealing with problems that are specific to short-read data. I'd like to think that in five years time sequencing will just work, and we will be able to devote our time to dealing with biological quirks rather than technical ones. I'm pretty sure I said the same thing five years ago, though.

011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Most of my advice wouldn't be work-related, but I'd certainly mention that the clock starts ticking on potential fellowship opportunities as soon as you get your PhD. I definitely missed the starting gun on that one.

100. What's your all-time favorite piece of bioinformatics software, and why?

I'll go for Prokka, because it does an astonishingly good job at annotating bacterial genomes (better than many manual attempts...), because Torsten wrote the book (well, blog post) on creating usable command-line bioinformatics tools. I particularly like that it checks for its dependencies at the start, rather than choking half-way through, and because it sometimes finishes with a quote from the Hitchhiker's Guide to the Galaxy.

Other than that, I'm a big fan of MUMmer, and I'm always impressed by how many different things it's possible to achieve by stringing two or three SAMtools commands together. If non-bioinformatics-specific software counts, then I'd also mention GNU Parallel, Perl and UNIX itself.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

M, because it's Not K.

Searching for sausage rolls: using Google Scholar to look at the popularity of British culinary delights

December 10, 2014 by Keith Bradnam

Sometimes it can be fun to search Google Scholar for words or phrases that you might not expect to ever appear in the title of an academic article. So last night, I conducted an important scientific study and looked at the popularity of various quintessential items of Britsh cuisine:

Among all of the foods that I searched for, Fish and Chips proved the most popular item with 52 results. Most of these are articles talking about Fluorescent In Situ Hybridization (FISH) and Chromatin IP (ChIP) experiments.
The next most popular item was the healthy delight that is Black Pudding. This was represented by 9 results, one of which is this gem from the British Medical Journal: Controlled prospective study of faecal occult blood screening for colorectal cancer in Bury, black pudding capital of the world.
Sticking with puddings, I looked at the popularity of what is unquestionably the King of all puddings: the Yorkshire pudding. This has just 3 search results and one of these appears to be a gripping thriller: Rheological Study of Batter Dough for Yorkshire Pudding Production.
There were only 3 results for pork pies but they include the wonderful title of a PhD thesis from the University of Nottingham: Storage changes in pork pies (a real page turner!).
The beloved Cornish Pasty also merits just 3 results including a paper in the Journal of Genetics and Development that sounds bizarre: A modified cornish pasty method for ex ovo culture of the chick embryo.
There are 3 mentions for Spotted Dick, one of which seems to be a zinc-finger protein in Drosophila.
The humble Sausage roll gets only a solitary mention (a piece in New Scientist titled Sausage-roll science: the battle of the buffet).
Last, but not least, is a dish that often confuses people that are not from the UK. The delighful Toad in the hole was something that I thought would never feature at all in this list. It only merits 1 result, but what a result! The article in question is something from a 1924 issue of The Boston Medical and Surgical Journal titled: The Toad in the Hole Circumcision—A Surgical Bugbear.

Updated: 2014-12-10: includes addition of 'Spotted Dick' thanks to reader @MattBashton.

Not all bioinformatics software names have to be acronyms! →

December 10, 2014 by Keith Bradnam

From the most recent issue of the journal Bioinformatics, there is a bioinformatics tool with a delightful name that is unique, pronounceable, and fun. Most refreshingly, it is not an acronym:

MindTheGap: integrated detection and assembly of short and long insertions

B-O-G-U-S

December 09, 2014 by Keith Bradnam

There's a lot of it about at the moment…

MAGeCK: Model-based Analysis of Genome-wide CRISPR/Cas9 Knockout

Slightly bogus.

PrEMeR-CG: Probabilistic Extension of Methylated Reads at CpG resolution

Somewhat bogus.

ODIN: One-stage DIffereNtial peak caller

Bogus.

Gustaf: Generic mUlti-SpliT Alignment Finder

Definitely bogus.

iPro54-PseKNC: Predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition

I'm not sure if 'PseKNC' meant to be pronounced 'sequence'? I'm also not sure if the 'Pro' refers to prokaryote' or 'promoters'. Likewise, I'm not sure if the 'i' is for 'identifying' or is just an Apple-style brand prefix. Finally, I'm not even sure if this is meant to be an acronym at all. But it looks bogus to me.

Tales of drafty genomes: part 1 — The Human Genome

December 08, 2014 by Keith Bradnam

One of my recent blog posts discussed this new paper in PLOS Computational Biology:

Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies

There has also been a lot of chatter on twitter about this paper. Here is just part of an exchange that I was involved in yesterday:

The issue of what is or isn’t a draft genome — and whether this even matters — is something on which I have much to say. It’s worth mentioning that there are a lot of draft genomes out there: Google Scholar reports that there are 1,440 artices that mention the phrase ‘draft genome’ in their title [1]. In the first part of this series, I’ll take a look at one of the most well-studied genome sequences in existence…the human genome.

The most famous example of a draft genome is probably the ‘working draft’ of the human genome that was announced — with much fanfare — in July 2000 [2]. At this time, the assembly was reported as consisting of “overlapping fragments covering 97 percent of the human genome”. By the time the working draft was formally published in Nature in January 2001, the assembly was reported as covering “about 94% of the human genome” (incidentally, this Nature paper seems to be first published use of the N50 statistic).

On April 14, 2003 the National Human Genome Research Institute and the Department of Energy announced the “successful completion of the Human Genome Project” (emphasis mine). This was followed by the October 2004 Nature paper that discussed the ongoing work in finishing the euchromatic portion of the human genome[3]. Now, the genome was being referred to as ‘near-complete’ and if you focus on the euchromatic portion, it was indeed about 99% complete. However, if you look at the genome as a whole, it was still only 93.5% complete [4].

Of course the work to correctly sequence, assemble, and annotate the human genome has never stopped, and probably will never stop for some time yet. As of October 14, 2014, the latest version of the human genome reference sequence is GRCh38.p1[5] lovingly maintained by the Genome Reference Consortium (GRC). The size of the human genome has increased just a little bit compared to the earlier publications from a decade ago[6], but there is still several things that we don’t know about this ‘complete/near-complete/finished’ genome. Unknown bases still account for 5% of the total size (that’s over 150 million bp). Furtheremore, there are almost 11 million bp of unplaced scaffolds that are still waiting to be given a (chromosomal) home. Finally, there remains 875 gaps in the genome (526 are spanned gaps and 349 unspanned gaps[7]).

If we leave aside other problematic issues in deciding what a reference genome actually is, and what it should contain[8], we can ask the simple question is the current human genome a draft genome? Clearly I think everyone would say ‘no’. But what if I asked is the current human genome complete? I’m curious how many people would say ‘yes’ and how many people would ask me to first define ‘complete’.

Here are some results for how many hits you get when Googling for the following phrases:

251,000 — finished human genome sequence
171,000 — “almost complete” human genome sequence
69,400 — “near complete” human genome sequence
26,200 — “essentially complete” human genome sequence

Scientists and journalists don’t help the situation by maybe being too eager to overhype the state of completion of the human genome[9]. In conclusion, the human genome is no longer a draft genome, but it is still just a little bit drafty. More on this topic of drafty genomes in part 2!

There are 1,570 if you don’t require the words ‘draft’ and ‘genome’ to be together in the article title. ↩
The use of the ‘working draft’ as a phrase had been in use since at least late 1998. ↩
There is also the entire batch of chromosome-specific papers published between 2001 and 2006. ↩
This percentage is based on the following line in the paper: “The euchromatic genome is thus ~2.88 Gb and the overall human genome is ~3.08 Gb” ↩
This is the 1st patched updated to version 38 of the reference sequence ↩
There are 3,212,670,709 bp in the latest assembly ↩
The GRC defines the two categories as follows:

Spanned gaps are found within scaffolds and there is some evidence suggesting linkage between the two sequences flanking the gap. Unspanned gaps are found between scaffolds and there is no evidence of linkage. ↩
Remember, human genomes are diploid and not only vary between individuals but can also vary from cell-to-cell. The idea of a ‘reference’ sequence is therefore a nebulous one. How much known variation do you try to represent (the GRC represents many alternative loci)? How should a reference sequence represent things like ribosomal DNA arrays or other tandem repeats? ↩
Jonathan Eisen wrote a great blog post on this: Some history of hype regarding the human genome project and genomics ↩

Recent changes to the ACGT blog

December 07, 2014 by Keith Bradnam

There's probably some published rules for how to write blog posts and I imagine that one of those rules might be don't blog about your blog. Oh well…

Over the last few months I've been making lots of tweaks to this site. Most of them have been subtle changes in order to make the pages less cluttered and more aesthetically pleasing — e.g. did you notice that I moved the 'About Blog Contact' menu bar ~20 pixels closer to the top of the page as it was just a little bit too lonely where it used to be? Aside from these minor cosmetic alterations, I've also made some more substantial changes:

You can no longer see tags for individual blog posts (but I have kept the tag cloud on the About page so you can still find all posts that share the same tag).
On the same About page, I added an option to let people subscribe to the blog via email (maximum one email per day).
Just about every post listed on the main page of this site used to have a 'Read More' link which you needed to click in order to read the entire article. I'm now restricting this only to my longer posts.
I've disabled comments from the blog; partly because I wasn't getting a lot of them but mostly because of the many excellent reasons stated on the @avoidcomments Twitter account.
I've started including occasional 'link blog' posts. These are short posts — often just a paragraph or two — that typically comment on a paper or on another blog post. The titles of link blog posts are themselves links to the source paper/blog post, and include a little arrow to indicate this. E.g. see here, here, or here.

As a final note, I'd like to thank everyone who has tweeted or otherwise spread the word about this blog. Aside from churning out more posts about bioinformatics tools with bogus names, I do want to make a bit more of an effort to write about some more substantive issues in bioinformatics. Stay tuned!

Ten recommendations for software engineering in research →

December 05, 2014 by Keith Bradnam

The list of recommendations in this new GigaScience paper by Janna Hastings et al. is not aimed at bioinformatics in particular, but many bioinformaticians would benefit from reading it. I can particularly relate to this suggestion:

Document everything
Comprehensive documentation helps other developers who may take over your code, and will also help you in the future. Use code comments for in-line documentation, especially for any technically chal- lenging blocks, and public interface methods. However, there is no need for comments that mirror the exact detail of code line-by-line.

3 word summary of new PLOS Computational Biology paper: genome assemblies suck

December 05, 2014 by Keith Bradnam

Okay, I guess a more accurate four word summary would be 'genome assemblies sometimes suck'.

The paper that I'm referring to, by Denton et al, looks at the problem of fragmentation (of genes) in many draft genome assemblies:

Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies

As they note in their introduction:

Our results suggest that low-quality assemblies can result in huge numbers of both added and missing genes, and that most of the additional genes are due to genome fragmentation (“cleaved” gene models).

One section of this paper looks at the quality of different versions of the chicken genome and CEGMA is one of the tools they use in this analysis. I am a co-author of CEGMA, and reading this paper brought back some memories of when we were also looking at similar issues.

In our 2nd CEGMA paper we tried to find out why 36 core genes were not present in the v2.1 version of the chicken genome (6.6x coverage). Turns out that there were ESTs available for 29 of those genes, indicating that they are not absent from the genome, just from the genome assembly. This led us to find pieces of these missing genes in the unanchored set of sequences that were included as part of the set of genome sequences (these often appear as a 'ChrUn' FASTA file in genome sequence releases).

Something else that we reported on in our CEGMA paper is that, sometimes, a newer version of a genome assembly can actually be worse than what it replaces (at least in terms of genic content). Version 1.95 of the Ciona intestinalis genome contained several core genes that subsequently disappeared in the v2.0 release.

In conclusion — and echoing some of the findings of this new paper by Denton et al.:

Many genomes are poorly assembled
Many genomes are poorly annotated (often a result of the poor assembly)
Newer versions of genome assemblies are not always better