A new tool to boost the N50 length of your genome assembly

We all know that the most important aspect of any genome assembly is the N50 length of its contigs or scaffolds. Higher N50 lengths are clearly correlated with increases in assembly quality and any good bioinformatician should be looking to maximize the N50 length of any assembly they are making.

I am therefore pleased that I can today announce the release of a new software tool, N50 Booster!!! that can help you increase the N50 length of an existing assembly. This tool was written in C for maximum computational efficiency and then reverse engineered into Perl for maximum obfuscation.

This powerful software is available as a Perl script (n50_booster.pl) that can be downloaded from our lab's website. The only requirement for this script is the FAlite.pm Perl module (also available from our lab's website).

Before I explain how this script works to boost an assembly's N50 length, I will show a real-world example. I ran the script on release WS230 of the Caenorhabditis japonica genome assembly:

$ n50_booster.pl c_japonica.WS230.genomic.fa

Before:
==============
Total assembly size = 166256191 bp
N50 length = 94149 bp

Boosting N50...please wait

After:
==============
Total assembly size = 166256191 bp
N50 length = 104766 bp

Improvement in N50 length = 10617 bp

See file c_japonica.WS230.genomic.fa.n50 for your new (and improved) assembly

As you can see, N50 Booster!!! not only makes a substantial increase to the N50 length of the C. japonica assembly, it does so while preserving the assembly size. No other post-assembly manipulation tool boasts this feature!

The n50_booster.pl script works by creating a new FASTA file based on the original (but which includes a .n50 suffix) and ensures that the new file has an increased N50 length. The exact mechanism by which N50 Booster!!! works will be evident from an inspection of the code.

I am confident that N50 Booster!!! can give your genome assembly a much needed boost and the resultant increase in N50 length will lead to a much superior assembly which will increase your chances of a publication in a top-tier journal such as the International Journal of Genome Assembly or even the Journal of International Genome Assembly.

Update: 2014-04-08 09.44 — I wrote a follow up post to this one which goes into more detail about how N50 Booster!!! works and discusses what people could (and should) do to the shortest sequences in their genome assemblies.

Why I think buzzword phrases like 'big data' are a huge problem

There are many phrases in bioinformatics that I find annoying because they can be highly subjective:

  • Short-read mapping
  • Long-read technology
  • High-throughput sequencing

One person's definition of 'short', 'long', or 'high' may differ greatly from someone else's. Furthermore, our understanding of these phrases changes with the onwards march of technological innovation. Back in 2000 'short' meant 16–20 bp whereas in 2014, 'short' can mean 100–200 bp.

The new kid on the block, which is not specific to bioinformatics, is 'big data'. Over the last week, I've been helping with a NIH grant application entitled Courses for Skills Development in Biomedical Big Data Science. This grant mentions the phrase thirty-nine times so it must be important. Why do I dislike the phrase so much? Here is why:

  1. Even within a field like bioinformatics, it's a subjective term and may not mean the same thing to everyone.
  2. Just as the phrases 'next-generation' and 'second-generation' sequencing inspired a set of clumsy and ill-defined successors (e.g. '2.5th generation', 'next-next-next generation' etc.), I expect that 'big data' might lead to similar language atrocities being committed. Will people start talking about 'really big data' or 'extremely large data' to distinguish themselves from one another?
  3. This term might be subjective within bioinformatics, but it probably much more subjective when used across different scientific disciplines. In astronomy there are space telescopes that are producing petabytes of data. In the field of particle physics, the Data Center at the Wigner Research Centre for Physics processes one petabyte of data per day. If you work for the NSA, then you may well have exabytes of data lying around.

I joked about the issue of 'big data' on twitter:

My Genome Center colleague Jo Fass had a great comment in response to this:

This is an excellent point. When people talk about the challenges of working with 'big data', it really depends on how well your infrastructure is equipped to deal with such data. If your data is readily accessible and securely backed up, then you may only be working with 'data' and not 'big data'.

In another post, I will suggest that the issue for much of bioinformatics is not 'big data' per se but 'obese data', or even 'grotesquely obese data'. I will also suggest a sophisticated computational tool that I call Operational Heuristics for Management of Your Grotesquely Obese Data (OHMYGOD), but which you might know as rm -f.

101 questions with a bioinformatician #0: Ian Korf

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

 

Ian Korf is the Associate Director of the UC Davis Genome Center and interim manager of the Genome Center Bioinformatics Core facility. He also mentioned that "According to the cake I was given after getting tenure, my title is also 'ASS PRO' (there wasn't enough room to write Associate Professor"). He also asks to "Please just call me Ian. If you call me Dr. Korf, I'll think you're talking about my dad (the in-famous Mycologist)".

You can find out more about Ian from the Korf Lab website and from his new SuperScience and Sorcery blog. Ian is also on twitter as @iankorf. Careful though, you might find his constant tweeting a bit distracting.

And now, on to the 101 questions...

 

001. What's something that you enjoy about current bioinformatics research?

The constant innovation. There's are so many problems to solve and so much cleverness out there.


010. What's something that you *don't* enjoy about current bioinformatics research?

The constant innovation. Many people are inventing the same thing and giving it a different name. Redundancy is sometimes unavoidable (but also sometimes useful).


011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career? 

Be nicer to everyone. Science is a social activity. When choosing between being correct and being agreeable, I generally tend towards correctness. Sometimes it's necessary to beat someone with the club of correctness (sounds like a Munchkin card), but more and more I think it's better to walk a little off the optimal path if you can walk there with someone.


100. What's your all-time favorite piece of bioinformatics software? 

BLAST, for many reasons. It has great historical importance and is still relevant today. It has a lot of educational value because it has a mixture of rigorous theory and rational heuristics.


101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

That website has only 15 (U is for RNA and gaps do not symbolize letters - see Karlin-Altschul statistics). But to answer the question, I think B, because B is a lot of things, but not A.

101 Questions: a series of interviews with notable bioinformaticians

I've noticed recently that more and more bioinformaticians are talking about what it means to be a bioinformatician, and — in an ideal world — what sort of training bioinformaticians of the future should be getting. Mick Watson's Opiniomics blog in particular has a lot of great material on these subjects. One of his blog posts that talks about the disconnect between actual bioinformatics skills a student/postdoc might have and the expectation of those skills from their trainer/supervisor makes for sober reading (especially if you see all of the comments).

So I thought it might be instructive to ask a simple series of questions to a bunch of notable bioinformaticians to assess their feelings on the current state of bioinformatics research, and maybe get any tips they have about what has been useful to their bioinformatics careers.

I will attempt to keep this post updated with links to all of the interviews that I have published:

List of interviews

Next-generation sequencing must die (part 3) — a tale of two titles

This morning I came across two new papers. Compare and contrast:

  1. De Novo Assembly and Annotation of Salvia splendens Transcriptome Using the Illumina Platform — Ge et al. PLOS ONE
  2. RepARK—de novo creation of repeat libraries from whole-genome NGS reads — Koch et al. Nucleic Acids Research

The former paper lets me know that the research is based on a specific sequencing technology whereas the latter paper is possibly suggesting that the RepARK tool might work with any 'NGS' data.

Given the wide, and sometimes inappropriate, use of the 'NGS' phrase, it is not always obvious what someone means when they refer to 'NGS reads'. This could include 25–30 bp reads from older Illumina sequencing all the way to PacBio reads that may be >15,000 bp (and which contain a high fraction of indels).

Reading the Methods section of the paper, I see that they only used simulated 101 bp reads as well as real Illumina reads (average length = 82 bp). They do point out in the discussion that "long [PacBio] reads may also provide new opportunities for de novo repeat prediction" . This is something that I have an interest in because we have previously published data that used PacBio data to find tandem repeats. 

In order to find out that they don't have any PacBio data, I had to read the title, abstract, methods, and then scan the rest of the paper. I accept that 'NGS' is a convenient term to use, but it would have been helpful (to me anyway) if at least the abstract could have pinpointed which NGS technologies the paper was using.

A very tenuously derived bioinformatics acronym...and winner of a JABBA award!

I can only imagine that some papers start off with the name of software tool that they want to use, and then work backwards to form an acronym or initialism. After exhausting valid combinations — where the letters are derived from the initial letters of each word — they switch to plucking letters at random. All that matters is coming up with a fun word or phrase for your tool, right?

And so that brings us to MATE-CLEVER, a new tool described in the latest issue of the journal Bioinformatics. The article title at least provides a hint for where the 'M' comes from:

MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels

The fact that the title of the paper includes no words beginning with 'T', 'E', 'C', 'V', or 'R' makes me a little bit afraid as to just where these letters will come from. Ready for this? MATE-CLEVER is derived from:

Mendelian-inheritance-AtTEntive CLique-Enumerating Variant findER

This is a particularly egregious use of selectively choosing the letters to fit your desired name, and for that feat this paper becomes a recipient of yet another JABBA award.

Update 2014-03-18 14.10: The more I look at this name, the more I think they missed the opportunity to name it 'MEAT-CLEAVER'.

Trying to download the cow genome: where's the beef?

Cows on the UC Davis campus, photo by Keith Bradnam

Cows on the UC Davis campus, photo by Keith Bradnam

I want to download the cow genome. A trip to bovinegenome.org seemed like a good place to start. However, the home page of this site is quick to reveal that there is not only assembly Btau_4.6 that was released by the Baylor College of Medicine Human Genome Sequencing Center, there is also an alternative UMD3.1 assembly by the Center for Bioinformatics and Computational Biology (CBCB) at the University of Maryland.

In many areas of life, choice is a good thing. But in genomics, trying to decide which of two independent genome resources to use, well that can be a bit of a headache. The bovinegenome.org site tells me that 'Btau_4.6' was from November 2011. I have to go to the CBCB site to find out that UMD3.1 was made available in December 2009.

Bovinegenome.org helpfully provides a link to the UMD3.1 assembly. Not so helpfully, the link doesn't work. At this point I decide to download the 'Btau_4.6' assembly and associated annotation. The Btau downloads page provides a link to the Assembly 4.0 dataset (in GFF3 format) under a heading of 'Datasets for Bovine Official Gene Set version 2'. So at this point, I have to assume that maybe that the '4.0' might really be referring to the latest 4.6 version of the assembly.

The resulting Assembly 4.0 page has links for each chromosome (available as GFF3 or gzipped GFF3). These links include a 'ChrUn' entry. You will see this in most modern genome assemblies as there is typically a lot of sequence (and resulting annotation) that is 'unplaced' in the genome, i.e. it cannot be mapped to a known chromosome. In the case of cow, ChrUn is the second largest annotation file.

This page also has four more download links for what bovinegenome.org refers to as:

  • 'Reference Sequences'
  • 'CDS Target'
  • 'Metazoa Swissprot Target'
  • 'EST Target'

From the Assembly 4.0 downloads page (which may or may not mean version 4.6) 

No README file or other explanation is provided for what these files contain and if you try to download the GFF3 links (rather than the gzipped versions), you will find that these links don't work. The 'CDS Target' link lets you download a file named 'incomplete_CDS_target.gff3'. I do not know whether the 'incomplete' part refers to individual CDSs (missing stop codons?) or the set of CDSs as a whole. I also do not know what the 'target' part refers to.

The chromosome-specific GFF3 files only contain coordinates of gene features, which utilize several different GFF source fields:

  • NCBI_transcript
  • annotation
  • bovine_complete_cds_gmap_perfect
  • ensembl
  • fgenesh
  • fgeneshpp
  • geneid
  • metazoa_swissprot
  • sgp

Again, no README file is provided to make sense of these multiple sources of gene information. In contrast to the chromosome-specific files, the 'reference_sequences.gff3' and 'incomplete_CDS_target.gff3' contain coordinate information and embedded sequence information in FASTA format (this is allowed in the GFF3 spec).

If you look at the coordinate information in the reference sequence file, you see multiple chromosome entries per biological chromosome. E.g.

From the reference_sequences.gff3 file

I do not know what to make of these.

If you want to get the sequences of all the genes, you have to go to yet another download page at bovinegenome.org where you can download FASTA files for CDS and proteins of either the Full Bovine Official Gene Set v2 or the Subset of Bovine Official Gene Set v2 derived from manual annotations. What exactly is the process that generated the 'manual annotations'? I do not know. And bovinegenome.org does not say.

So maybe the issue here is that I should get the information straight from the horse's — or should that be cow's? — mouth. So I head over to the Bovine Genome Project page at BCM HGSC. This reveals that the latest version of the genome assembly is 'Btau_4.6.1' from July 2012! This would seem to supercede the 2011 'Btau_4.6' assembly (which is not even listed on the BCM HGSC page), but I guess I'll have to download it to be sure. The link on the above BCM HGSC page takes me to an FTP folder which is confusingly named 'Btau20101111' (they started the 2012 assembly in 2010?).

So it seems obvious that all I need to do now is to: check whether the bovinegenome.org 'Assembly 4.0' dataset is really listing files from the 'Btau_4.6' release; find out whether those files are different from the BCM HGSC 'Btau_4.6.1' release; understand what the differences are between the nine different sets of gene annotation; decipher what the 'CDS target' (a.k.a. 'incomplete_CDS_target') file contains, understand why there are 14,195 different 'chromosome' features in the genome GFF file that describes 30 chromosomes; and finally compare and contrast all of these sequences and annotations with the alternative UMD3.1 assembly.

Bioinformatics is fun!

 

Update 1 — 2014-03-14 14.58 pm: The BCM HGSC has an extremely detailed README file that explains the (slight) differences between Btau_4.6 and Btau_4.6.1 (as well as explaining many other things about how they put the genome together). I like README files!

Update 2 — 2014-03-14 15.18 pm: Of course I could always go the Ensembl route and use their version of the cow genome sequence with their annotations. However, the Ensembl 75 version of the cow genome has been built using the older UMD3.1 assembly, whereas an older Ensembl 63 version of the cow genome that is based on the Btau_4.0 assembly is also available (life is never simple!). One advantage to working with Ensembl is that they are very consistent with their annotation and nomenclature.

Update 3 — 2014-03-14 15.35 pm: I now at least realize why there are 14,195 'chromosome' features in the main GFF file provided by bovinegenome.org. They are treating each placed scaffold as its own 'chromosome' feature. This is particularly confusing because a GFF feature field of 'chromosome' is usually taken to refer to an entire chromosome. GFF feature names should be taken from Sequence Ontology terms (e.g. SO:0000340 = chromosome), so in this case the SO term chromosome_part would be more fitting (and less confusing). 

Update 4 — 2014-03-17 11.16 am: I have received some very useful feedback since writing this blog post:

  • I was kindly pointed towards the NCBI Genome Remapping Service by @deannachurch (also see her comment below) that would let me project one annotation data from one coordinate system to another. This could be handy if I end up using different cow genomes.
  • A colleague of mine pointed me to a great 2009 article by @deannachurch and LaDeanna Hillier that talks about how the simultaneous publication of two cow genomes (from the same underlying data) raises issues about how we deal with publication and curation of such data sets.
  • Several people contacted me to give their opinion as to the relative merits of the UMD and the BCM HGSC assemblies.
  • Several people also echoed my sentiments about working with the same cow data, but by using Ensembl's architecture instead.
  • I was contacted by someone from bovinegenome.org who said that they would fix the issues that I raised. This was after I contacted them to tell them I had blogged about the problems that I had encountered. 

The last point is perhaps the most important one which I would like to dwell on for a moment longer. If you are a scientist and experience problems when trying to get hold of data from a website (or FTP site) don't sit in silence and do nothing!

If nobody tells a webmaster that there is an issue with getting hold of data, then the problem will never be fixed. Sometimes files go missing from FTP sites, or web links break, or simple typos make things unclear....the curators involved may never know unless it is pointed out to them.

Next-generation sequencing must die (part 2) — understanding the generation gap

Screen Shot 2014-03-10 at 2.58.05 PM.png

As a brief follow-up to my previous post, I'd like to clarify that next-generation sequencing may refer to technologies from Illumina, 454, SOLiD, Helicos, Ion Torrent, Complete Genomics, PacBio, or Oxford Nanopore (these links all refer to different papers).

If we want to get more specific, we need to recognize that Complete Genomics is a second generation technology...except when it is a third generation technology. In contrast, we should be clear that Oxford Nanopore is the only example of fourth generation technology...apart from when it is third generation technology. We can at least be sure that Ion Torrent is definitely a second generation platform...unless it's a third generation platform. One paper clarifies this situation by observing that Ion Torrent "sits between" the second and third generation categories.

Further illumination on this subject is provided by the confirmation that PacBio is either a second generation, third generation...or even a "2.5th" generation technology. Likewise, Helicos is also a second generation, third generation, or lies "in between the transition of next-generation sequencing to third generation" sequencing technologies.

So hopefully that's a lot clearer now.

Next-generation sequencing must die!

Screen Shot 2014-03-07 at 4.05.14 PM.png

I hate the phrase next-generation sequencing (NGS) with a passion. Here's why...

The first published use of this phrase (that I can find) is from an article in Drug Discovery Today: Technologies by Thomas Jarvie in 2005. This paper had the succinct title Next generation sequencing technologies, and while this may represent the first time this phrase made it into print, it certainly wouldn't be the last.

Illumina sequencing may be the most obvious technology that springs to mind when people think of NGS, but there is also Pyrosequencing (developed circa 1996), Massively Parallel Signature Sequencing (circa 2000), ABI SOLiD sequencing (circa 2008), and Ion semiconductor sequencing (circa 2010).

Of course we also have single molecule real time sequencing by Pacific Biosciences. They were founded in 2004 but didn't launch their PacBio RS machines until 2010. The current darling of the sequencing world is nanopore technology, something which has been in development since the mid 1990s.

So do we refer to this entire period (from development to finished technology) as the NGS era? If so, then NGS technologies have already been around for almost 20 years. It doesn't strike me as particularly helpful to keep on labeling all of these different technologies with the same name.

Some people have tried to make things clearer by introducing yet more levels of obfuscation. This has led to some of these technologies being referred to as either second generation, third generation, fourth generation, next-next-generation, and even next-next-next generation. And of course these definitions are all subjective and one person's third-generation technology is another person's fourth-generation technology.

Other alternatives to NGS such as high-throughput sequencing or long-read sequencing are equally useless because 'high' and 'long' are both relative terms. The output from a high-throughput sequencing platform of 2008 might seem like 'low-throughput' today. The weakness of length-based descriptions is the reason why the 'Short Read Archive' was (thankfully) reborn as the Sequence Read Archive.

So here is my proposed three-step solution to rid the world of this madness:

  1. Don't use any of these terms, ever again.
  2. Just refer to a technology by a name that describes the methodology (e.g. sequencing-by-synthesis) or by the name of a company that has developed a specific product (e.g. Oxford Nanopore).
  3. You could even just use the term 'current sequencing technologies'...as long as your paper/talk/blog/book has a date associated with it, then I'm confident people will know what you mean.

Update 12th March: I have added a follow-up post to this one.

Some more JABBA awards to highlight bad bioinformatics acronyms

jabba logo.png

More JABBA awards for bogus bioinformatics acronyms. These all come from a recent issue of Bioinformatics:

Honorable mention: