JABBA vs Jabba: when is software not really software?

It was only a matter of time I guess. Today I was alerted to a new publication by Simon Cockell (@sjcockell), it's a book chapter titled:

From the abstract:

Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data

Now as far as I can tell, this Jabba is not an acronym, so we safely avoid the issue of presenting a JABBA award for Jabba. However, one might argue that naming any bioinformatics software 'Jabba' is going to present some problems because this is what happens when you search Google for 'Jabba bioinformatics'.

There is a bigger issue with this paper that I'd like to address though. It is extremely disappointing to read a software bioinformatics paper in the year 2015 and not find any explicit link to the software. The publication includes a link to http://www.ibcn.intec.ugent.be, but only as part of the author details. This web page is for the Internet Based Communication Networks and Services research group at the University of Gent. The page contains no mention of Jabba, nor does their 'Facilities and Tools' page, nor does searching their site for Jabba.

Initially I wondered if this is paper is more about the algorithm behind Jabba (equations are provided) and not about an actual software implementation. However, the paper includes results from their Jabba tool in comparison to another piece of software (LoRDEC) and includes details of CPU time and memory requirements. This suggests that the Jabba software exists somewhere.

To me this is an example of 'closed science' and represents a failure of whoever reviewed this article. I will email the authors to find out if the software exists anywhere…it's a crazy idea but maybe they might be interested if people could, you know, use their software.

Update 2015-11-20: I heard back from the authors…the Jabba software is on GitHub.

Get shorty: the decreasing usefulness of referring to 'short read' sequences

I came across a new paper today:

When I see such papers, I always want to know 'what do you consider short?'. This particular paper makes no attempt to define what 'short' refers to, with the only mention of the 'S' word in the paper being as follows:

Finally, qAlign stores metadata for all generated BAM files, including information about alignment parameters and checksums for genome and short read sequences

There are hundreds of papers that mention 'short read' in their title and many more which refer to 'long read' sequences.

But 'short' and 'long' are tremendously unhelpful terms to refer to sequences. They mean different things to different people, and they can even mean different things to the same person at different times. I think that most people would agree that Illumina's HiSeq and MiSeq platforms are considered 'short read' technologies. The HiSeq 2500 is currently capable of generating 250 bp reads (in Rapid Run Mode), yet this is an order of magnitude greater than when Illumina/Solexa started out generating ~25 bp reads. So should we refer to these as long-short reads?

The length of reads generated by the first wave of new sequencing technologies (Solexa/Illumina, ABI SOLiD, and Ion Torrent) were initially compared to the 'long' (~800 bp) reads generated by Sanger sequencing methods. But these technologies have evolved steadily. The latest reagent kits for the MiSeq platform offer the possibility of 300 bp reads. However, if you perform paired end sequencing of libraries with insert sizes of ~600 bp, then you may end up generating single consensus reads that approach this length. Thus we are already at the point where a 'short read' sequencing technology can generate some reads that are longer than some of the reads produced by the former gold-standard 'long read' technology.

But the read lengths of any of these technologies pales into comparison when we consider the output of instruments from Pacficic Biosciences and Oxford Nanopore. By their standards, even Sanger sequencing reads could be considered 'short'.

If someone currently has reads that are 500-600 bp in length, it is not clear whether any software tool that proclaims to work with 'short reads' is suitable or not. Just as the 'Short Read Archive' (SRA) became the more-meaningfully-named Sequence Read Archive, so we as a community should banish these unhelpful names. If you develop tools that are optimized to work with 'short' or 'long' read data, please provide explicit guidelines as to what you mean!

To conclude:

There are no 'short' or 'long' reads, there are only sequences that are shorter or longer than other sequences.

Data access for the 1,000 Plants (1KP) project

From the abstract of a new paper in GigaScience:

The 1,000 plants (1KP) project is an international multi-disciplinary consortium that has generated transcriptome data from over 1,000 plant species, with exemplars for all of the major lineages across the Viridiplantae (green plants) clade. Here, we describe how to access the data used in a phylogenomics analysis of the first 85 species, and how to visualize our gene and species trees.

The paper doesn't provide a link to what seems to be the actual project website. They mention directories within the iPlant Collaborative project where you can access data. The project website reveals that this project can be referred to either '1000 plants', 'oneKP' or '1KP' (but not '1000P'?).

Being a pedantic kind of guy, I was curious by the paper's vague mention of 'over 1,000 plant species'. How many species exactly? The paper doesn't say. But if you go to one of the iPlant pages for 1KP, you will see this:

Altogether, we sequenced 1320 samples (from 1162 species)

So this project seems to have exceeded the boundaries suggested by its name. How about the '1.2KP' project?

Identical Classifications In Science: Some advice for Jonathan Eisen

Jonathan Eisen — a colleague at the UC Davis Genome Center — has a quandary. He came up with a name for one of his projects but now needs to consider renaming it. The problem is that ICIS (Innovating Communication in Scholarship) sounds a bit like…well you all know what it sounds like. So Jon has appealed for suggestions on how to rename their project.

He should take comfort that he may not be the only one facing this dilemma. After all, the International Cooperative ITP Study Group (ICIS) has been an ongoing collaboration between hematologists since 1997. I wonder whether they are considering a name change? Maybe Jon could also ask the folk at the International Conference on Information Systems (ICIS) who have been meeting since 1980. Or they could talk to the people that came up with the Intelligent Coin Identification System (ICIS), or the The Intensive Care Infection Score (ICIS), or the Integrated Crate Interrogation System (ICIS), or the 20 year old International Crop Information System (ICIS), or the people who named this gene.

These are just some of the academic uses of ICIS that I could find from a couple of quick searches. I expect that there are more out there. This is a reflection on one of the most primal desires of all scientists…the need to come up with an acronym or initialism for their project. This urge is all too commonly associated with the additional need to make the name 'fun' (particularly a desire to name things after animals). Acronyms can also backfire for other reasons, such as when you don't fully appreciate how it might sound in other countries.

The shorter your acronym, the more likely that it has been used by other people before you (even within the same field). My suggestion would be to consider the shocking alternative of not using an acronym at all! After all, sometimes people can come up with new names that seem to catch on.

Why the UCSC Genome Browser FTP site is one of my least favorite places to visit

If you visit the Golden Path directory of the UCSC Genome Browser FTP site (ftp://hgdownload.cse.ucsc.edu//apache/htdocs/goldenPath), you will come across the following quirks:

  1. Multiple genomes for the same species are not grouped together under a parent directory for each species, so the number of items in this directory (~250) gives no indication of the number of species represented (~125).
  2. Species identifiers are ambiguous. You have to know that 'mm9' refers to Mus musculus and not Macaca mulatta
  3. Species identifiers are also inconsistent. Some species get just two lower-case characters (e.g. 'mm' = Mus musculus, 'dm' = Drosophila melanogaster) whereas most get six characters (e.g. 'felCat' = Felis catus, 'sacCer' = Saccharomyces cerevisiae).
  4. Humans, hallowed species that we are, simply get 'hg' (presumably for 'human genome').
  5. The six-character format reverses centuries (!) of naming convention by making the genus part of the name start with a lower-case character and the specific part of the name start with an upper-case character.
  6. Some species also have date-versioned directories in addition to numerical-suffixed directories. So do you want to download the 'hg7' version of the human genome or instead get the 'hg7oct2000_oo21' (don't ask me what the 'oo_21' part means)?

If you want a challenge, try writing some bioinformatics software that goes from the Latin name for a species to the correct directory on their FTP site! I guess the UCSC team are going to hope that six characters is enough to uniquely identify any future species that end up here. So I hope they don't start sequencing too many more Drosophila species. E.g.

Compare this madness — and it is madness — to the calming orderliness of the Ensembl Genomes FTP site (e.g. ftp://ftp.ensemblgenomes.org//pub/release-23/metazoa/fasta):

A view from UCSC Genome Browser FTP site…

A view from UCSC Genome Browser FTP site…

…compared to a view from the Ensembl Genomes FTP site

I think the key point from this story is that a lot of bioinformatics research can be hard enough without the added complexities of working with unstructured data. When you start building any new resource in bioinformatics, be it an FTP site, web site, GitHub repository, you should plan for the future! I.e. expect things to expand, grow, and greatly increase in complexity.

Even if you intend for a resource to only ever contain information for a single species, assume that it will end up containing hundreds of species. You should also assume that people may wish to automate the querying of your data. If you plan for these things from the moment you start building your resource, you might make some bioinformaticans happy — and you certainly don't want to make us angry…you wouldn't like us when we're angry.

Some sage advice on avoiding confusing names for bioinformatics tools

SAGE is a molecular technique used to investigate the mRNA population from a chosen sample. It stands for Serial Analysis of Gene Expression and was first described back in 1995. The technique spawned spin-offs such as LongSAGE, RL-SAGE (Really Long SAGE), and SuperSAGE.

Although this technique has largely been superseded by other methods (such as RNA-Seq), it is still widely referenced (over 1,300 publications from 2013 mention this technique).

Fast-forward to the present day and I note that a new tool has just been published in the journal BMC Bioinformatics:

SAGE: String-overlap Assembly of GEnomes

As long as you query your favorite web search engine for some combination of 'SAGE' and 'genome assembly' you will probably find this tool and not end up on one of the half a million pages that talk about the other SAGE. I'm still not sure whether it is a bit risky giving a new tool the same name as such an established molecular technique.

All of this means that there is the potential for a certain company to use the aforementioned molecular technique to help annotate the output of the aforementioned computational technique, and apply both of these techniques to data from a certain plant. This could give you the world's first SAGE, SAGE, SAGE, sage genome!

ACGT...TGCA — has every possible DNA-based initialism been used by the bioinformatics/genomics community?

 

Short answer

Yes. 

Long answer…

You might work in a field that's related to biology, genetics, genomics, or bioinformatics. You might be working on a new piece of software, or a research proposal, or you need to form a committee. Maybe you have even been given the power to name a new research facility.

Suddenly you have an inspiration...why don't we name our new software, proposal, committee, or facility after a DNA-based initialism! That would be clever and make us stand out from the crowd, right? Maybe...maybe not.

What follows is a fairly exhaustive list of — presumably intentional — DNA-based initialisms that are in use (or have been used). As of 2020-07-20 the current list contains 67 names in total with all 24 possible combinations of [ACGT] being used. The additions since I first created this page are included at the end.

See also this related blog post by David Lawrence from 2014, which I only discovered in mid-2020. His post — which beat me to the punch by just a couple of weeks! — has provided me with a few additional examples which I hadn’t heard about and which have now been included here.

Please let me know of any errors or omissions, though note that potential names have to be initialisms and has to be somewhat related to to the fields of genetics, genomics, or bioinformatics.


ACGT

  1. Advisory Committee on Genetic Testing — Committee — 1996
  2. Alliance for Cancer Gene Therapy — Research Network — 2001
  3. A Comparative Genomics Tool — Software — 2003
  4. Advancing Clinico-genomic Trials on Cancer — Research Project — 2011
  5. Algorithms in Computational Genomics at Tau — Lab web page — ???
  6. Advanced Center for Genome Technology — Research Center? — ???
  7. African Centre for Gene Technologies — Research Network — ???
  8. Applied Computational Genomics Team — Research Group — ???
  9. Amino aCids To Genome — Software — 2017
  10. Analysis of Czech Genomes for Theranostics — Research Project? — 2020?

ACTG

  1. Automatic Correspondence of Tags and Genes — Software — 2007

AGCT

  1. Applied Genomics & Cancer Theraeputics — Research Program? — ???

AGTC

  1. Applied Genomics Technology Center — Core Facility? — 1998
  2. Advanced Genome Technologies Core — Core Facility — ???
  3. University of Kentucky Advanced Genetic Technologies Center — Core Facility (now defunct?) — ???

ATCG

  1. Applied Technology in Conservation Genetics — Research Lab — ???

ATGC

  1. Arabidopsis Thaliana Genome Center — Core Facility? — 2000?
  2. Another Tool for Genome Comparison — Software — 2001
  3. Advanced Thermal Gradient dna Chip — Patent — 2002
  4. Another Tool for Genomic Comprehension — Database & web tool — 2012
  5. Alignable Tight Genomic Clusters - Database - 2009

CAGT

  1. Center for Advanced Genomic Technology — Research Facility — 2000?
  2. Center for Applied Genetics and Technology — Research Facility — 2004
  3. Center for Applied Genetic Technologies) — Research Facility — ???
  4. Clustering AGgregation Tool — Software — 2012?

CATG

  1. Cross-legume Advances Through Genomics — Conference — 2004?
  2. Center for Advanced Technologies in Genomics — Research Facility — 2008

CGAT

  1. Comparative Genome Analysis Tool — Software — 2006
  2. Computational Genomics Analysis and Training — Training program — 2010
  3. Computational Genomics Analysis Toolkit — Software — 2013
  4. Centre for Gene Analysis and Technology — Research Facility — ???
  5. Canadian Genome Analysis and Technology program — Research program (now defunct) — 1992

CGTA

  1. CNS Gene therapy Translation Acceleration - Research Group - ???

CTAG

  1. Corn Transcriptome Analysis Group — Working Group — 2014
  2. Canadian Triticum Advancement Through Genomics - Research project - 2011

CTGA

  1. the Catalogue for Transmission Genetics in Arabs — Database — 2006

GACT

  1. The Center for Genetic Architecture of Complex Traits - Research Center - 2013

GATC

  1. Genetic Analysis Technology Consortium — Biotech Consortium (now defunct?) — circa 1997?

GCAT

  1. Genome Comparison & Analytic Testing — Software? — ???
  2. Genome Consortium for Active Teaching — Teaching Consortium — 2007?
  3. Gene-set Cohesion Analysis Tool — Software — 2011 (or 2007) 4.Genotype-Conditional Association Test — Statistical method — 2015
  4. Genomics, Computational biology And Technology - study section - ???

GCTA

  1. Genome-wide Complex Trait Analysis — Software — 2011

GTAC

  1. Gene Technology Access Center — Teaching Facility — 2000
  2. Genomics Technology Access Center — Core Facility — 2009?
  3. Genome Technology Access Center — Core Facility — 2010
  4. Genomics/Transcriptomics Analysis Core — Core Facility — ???
  5. Genomes and Transcriptomes of Arctic Chromists — Research Program — 2012
  6. Gene Technology Advisory Committee — Government Committee — ???

GTCA

  1. Genomic Tetranucleotide Composition Analysis — Database — 2006
  2. Genome Transcriptome Correlation Analysis — Software — 2007

TACG

  1. Talking About Computing and Genomics — Workshop — 2013

TAGC

  1. The Applied Genomics Core — Core Facility — 1998
  2. The Ashkenazi Genome Consortium — Consortium — 2012
  3. Technological Advances for Genomics and Clinics — Research Lab/Program? — ???
  4. The Arts & Genomics Centre — An Arts/Science Center — ???
  5. The Allied Genetics Conference — Conference — 2016?
  6. Taxon-Annotated GC plots — software visualisation method/tool — 2013

TCAG

  1. The Centre for Applied Genomics — Research Facility — 2007?
  2. The Center for the Advancement of Genomics — Research Facility (superseded by this) — ???

TCGA

  1. The Centre for Genetic Anthropology — Research Facility — 1996
  2. The Tayside Centre for Genomic Analysis — Core facility — 2001 (?)
  3. The Center for Genomic Application — Core Facility — 2004
  4. The Cancer Genome Atlas — Research Program — 2006

TGAC

  1. The Genome Access Course — Training Course — 2002
  2. The Genome Analysis Center — Research Facility — 2009

TGCA

  1. The Genome Counselling App — iOS Application — 2014
 

Updates:

  • 2020-08-20 Added 5th example of ATGC, 3rd example of AGTC, 2nd example of CTAG, and 4th example of GCAT (all courtesy of David Lawrence)

  • 2020-07-18 Added 10th example of ACGT

  • 2019-07-23 Added 9th example of ACGT (thanks to Sam Lent @samanthalent)

  • 2016-09-03 Added 4th example of TCGA (thanks to @malcolmacaulay)

  • 2016-02-16 Added 6th example of TAGC

  • 2015-09-11 - Added 5th example of TAGC

  • 2015-07-06 - Added 8th example of ACGT

  • 2015-04-06 - Added 4th example of GCTA (thanks to John Didion)

  • 2014-12-12 - Added first usage of TACG (thanks to @NazeefaFatima)

  • 2014-04-25 - Added Jeff Ross-Ibarra's planned use of CTAG

  • 2014-04-25 - Included a second instance of AGTC

  • 2014-05-18 - Included a fourth example of TAGC

  • 2014-09-08 - Included first usage of CGTA, GACT, and TGCA

Why I think buzzword phrases like 'big data' are a huge problem

There are many phrases in bioinformatics that I find annoying because they can be highly subjective:

  • Short-read mapping
  • Long-read technology
  • High-throughput sequencing

One person's definition of 'short', 'long', or 'high' may differ greatly from someone else's. Furthermore, our understanding of these phrases changes with the onwards march of technological innovation. Back in 2000 'short' meant 16–20 bp whereas in 2014, 'short' can mean 100–200 bp.

The new kid on the block, which is not specific to bioinformatics, is 'big data'. Over the last week, I've been helping with a NIH grant application entitled Courses for Skills Development in Biomedical Big Data Science. This grant mentions the phrase thirty-nine times so it must be important. Why do I dislike the phrase so much? Here is why:

  1. Even within a field like bioinformatics, it's a subjective term and may not mean the same thing to everyone.
  2. Just as the phrases 'next-generation' and 'second-generation' sequencing inspired a set of clumsy and ill-defined successors (e.g. '2.5th generation', 'next-next-next generation' etc.), I expect that 'big data' might lead to similar language atrocities being committed. Will people start talking about 'really big data' or 'extremely large data' to distinguish themselves from one another?
  3. This term might be subjective within bioinformatics, but it probably much more subjective when used across different scientific disciplines. In astronomy there are space telescopes that are producing petabytes of data. In the field of particle physics, the Data Center at the Wigner Research Centre for Physics processes one petabyte of data per day. If you work for the NSA, then you may well have exabytes of data lying around.

I joked about the issue of 'big data' on twitter:

My Genome Center colleague Jo Fass had a great comment in response to this:

This is an excellent point. When people talk about the challenges of working with 'big data', it really depends on how well your infrastructure is equipped to deal with such data. If your data is readily accessible and securely backed up, then you may only be working with 'data' and not 'big data'.

In another post, I will suggest that the issue for much of bioinformatics is not 'big data' per se but 'obese data', or even 'grotesquely obese data'. I will also suggest a sophisticated computational tool that I call Operational Heuristics for Management of Your Grotesquely Obese Data (OHMYGOD), but which you might know as rm -f.

Celebrating an unsung hero of genomics: how Albrecht Kossel saved bioinformatics from a world of hurt

The following image is from one of the first publications to ever depict a DNA sequence in a textual manner:

From Wu & Kaiser, Journal of Molecular Biology, 1968. 

From Wu & Kaiser, Journal of Molecular Biology, 1968. 

This is taken from the 1968 paper by Wu and Kaiser: Structure and base sequence in the cohesive ends of bacteriophage lambda DNA. For almost half a century since this publication, it has become the norm to simply represent the sequence of any DNA molecule as a string of characters, one character per base. But have you ever stopped to consider why these bases have the names that they do?

Most of the work to isolate and describe the purines and pyrimidines that comprise nucleic acids came from the work of the German biochemist Albrecht Kossel. Between 1885 and 1901 he characterized the principle nucleobases that comprise the nucleic acids: adenine, cytosine, guanine, and thymine (though guanine had first been isolated in 1844 by Heinrich Gustav Magnus). The fifth nucleobase (uracil) was discovered by Alberto Ascoli, a student of Kossel.

Kossel's work would later be recognized with the 1910 Nobel Prize for Medicine. It should be noted that Kossel didn't just help isolate and describe these bases, he was also chiefly responsible for most of their names.

Albrecht Kossel, Image from wikimedia

Albrecht Kossel, Image from wikimedia

Guanine had already been named based on where it had first been discovered (the excrement of seabirds known as guano). Adenine was so named by Kossel because it was isolated from the pancreas gland ('adenas' in Greek). Thymine was named because it was isolated in nucleic acids from the thymus of a calf. Cytosine — the last of the four DNA bases to be characterized — was also discovered from hydrolysis of the calf thymus. Its name comes from the original name in German ('cytosin') and simply refers to to the Greek prefix for cell ('cyto').

While this last naming choice might seem a little dull, all bioinformaticians owe a huge debt of thanks to Albrecht Kossel. Thankfully, he ensured that all DNA bases have names that start with different letters. This greatly facilitates their representation in silico. Imagine if he had — in a fit of vanity — instead chosen to name these last two bases that he characterized after himself and his daughter Gertrude. If that had been the case then maybe we would today be talking about the bases adenine, albrechtine, guanine, and gertrudine. Not an insurmountable problem to represent with single characters — we already deal with the minor headache of representing purines and pyrimidines differently (using R and Y respectively) — but frankly, it would be a royal pain in the ass.

Thank you Albrecht. The world of bioinformatics is in your debt.