New BUSCO vs (very old) CEGMA

If I’m only going to write one or two blog posts a year on this blog, then it makes sense to return to my recurring theme of don’t use CEGMA, use BUSCO!

In 2015 I was foolishly optimistic that the development of BUSCO would mean that people would stop using CEGMA — a tool that we started developing in 2005 and which used a set of orthologs published in 2003! — and that we would reach ‘peak-CEGMA’ citations that year.

That didn’t happen. At the end of 2017, I again asked the question have we reached peak-CEGMA? because we had seen ten consecutive years of increasing publications.

Well I’m happy to announce that 2017 did indeed see citations to our 2007 CEGMA paper finally peak:

CEGMA citations by year (from Google Scholar)

CEGMA citations by year (from Google Scholar)

Although we have definitely passed peak CEGMA, it still receives over a 100 citations a year and people really should be using tools like BUSCO instead.

This neatly leads me to mention that a recent publication in Molecular Biology and Evolution describes an update to BUSCO:

From the introduction:

With respect to v3, the last BUSCO version, v5, features: 1) a major upgrade of the underlying data sets in sync with OrthoDB v10; 2) an updated workflow for the assessment of prokaryotic and viral genomes using the gene predictor Prodigal (Hyatt et al. 2010); 3) an alternative workflow for the assessment of eukaryotic genomes using the gene predictor MetaEuk (Levy Karin et al. 2020); 4) a workflow to automatically select the most appropriate BUSCO data set, enabling the analysis of sequences of unknown origin; 5) an option to run batch analysis of multiple inputs to facilitate high-throughput assessments of large data sets and metagenomic bins; and 6) a major refactoring of the code, and maintenance of two distribution channels on Bioconda (Grüning et al. 2018) and Docker (Merkel 2014).

Please, please, please…don’t use CEGMA anymore! It is enjoying a well-earned retirement at the Sunnyvale Home for Senior Bioinformatics Tools.

Three cheers for JABBA awards

jabba logo.png

These days, I mostly think of this blog as a time capsule to my past life as a scientist. Every so often though, I’m tempted out of retirement for one more post. This time I’ve actually been asked to bring back my JABBA awards by Martin Hunt (@martibartfast)…and with good reason!

There is a new preprint in bioRxiv…

I’m almost lost for words about this one. You know that it is a tenuous attempt at an acronym or initialism when you don’t use any letters from the 2nd, 3rd, 4th, or 5th words of the full software name!

The approach here is very close to just choosing a random five-letter word. The authors could also have had:

CLAMP: hierarChical taxonomic cLassification for virAl Metagenomic data via deeP learning

HOTEL: hierarcHical taxOnomic classificaTion for viral mEtagenomic data via deep Learning

RAVEN: hieraRchical tAxonomic classification for Viral metagenomic data via dEep learNing

ALIEN: hierArchical taxonomic cLassification for vIral metagEnomic data via deep learniNg

LARVA: hierarchicaL taxonomic classificAtion for viRal metagenomic data Via deep leArning

Okay, as this might be my only blog post of 2020, I’ll say CHEERio!

DOGMA: a new tool for assessing the quality of proteomes and transcriptomes

A new tool, recently published in Nucleic Acids Research, caught my eye this week:

The tool, by a team from the University of Münster, uses protein domains and domain arrangements in order to assess 'completeness' of a proteome or transcriptome. From the abstract…

Even in the era of next generation sequencing, in which bioinformatics tools abound, annotating transcriptomes and proteomes remains a challenge. This can have major implications for the reliability of studies based on these datasets. Therefore, quality assessment represents a crucial step prior to downstream analyses on novel transcriptomes and proteomes. DOGMA allows such a quality assessment to be carried out. The data of interest are evaluated based on a comparison with a core set of conserved protein domains and domain arrangements. Depending on the studied species, DOGMA offers precomputed core sets for different phylogenetic clades

Unlike CEGMA and BUSCO, which run against unannotated assemblies, DOGMA first requires a set of gene annotations. The paper focuses on the web server version of DOGMA but you can also access the source code online.

It's good to see that other groups are continuing to look at new ways of asssessing the quality of large genome/transcriptome/proteome datasets.

What's in a name?

Initially, I thought the name was just a word that both echoed 'CEGMA' and reinforced the central dogma of molecular biology. Hooray I thought, a bioinformatics tool that just has a regular word as a name without relying on contrived acronyms.

Then I saw the website…

  • DOGMA: DOmain-based General Measure for transcriptome and proteome quality Assessment

This is even more tenuous than the older, unrelated, version of DOGMA:

  • DOGMA: Dual Organellar GenoMe Annotator

Beyond Generations: My Vocabulary for Sequencing Tech

Many writers have attempted to divide Next Generation Sequencing into Second Generation Sequencing and Third Generation Sequencing. Personally, I think it isn't helpful and just confuses matters. I'm not the biggest fan of Next Generation Sequencing (NGS) to start with, as like "post-modern architecture" (or heck, "modern architecture") it isn't future-proofed.

Keith Robison gives an interesting deep dive on how sequencing technologies have been named and potentially could be named.

This post reminded me of my previous takes on the confusing, and inconsistent labelling of these technologies: