CEGMA is dying…just very, very slowly

This is my first post on this blog in almost three years and it is now almost nine years since I could legitimately call myself a genomics researcher or bioinformatician.

However, I feel that I need to 'come out of retirement' for one quick blog post on a topic that has spanned many others…CEGMA.

As I outlined in my last post on this blog, the CEGMA tool that I helped develop back in 2005 and which was first published in 2007, continues to be used.

This is despite many attempts to tell/remind people not to use it anymore! There are better tools out there (probably many that I'm not even aware of). Fundamentally, the weakness of using CEGMA is that is based on an identified set of orthologs that was published over two decades ago.

And yet, every week I receive Google Scholar alerts that tell me that someone else has cited the tool again. We (myself and Ian Korf) should perhaps take some of the blame for keeping the software available on the Korf Lab website (I wonder how many other bioinformatics tools from 2007 can still be downloaded and successfully run?).

CEGMA citations (2011-2024)

When I saw that citations had peaked in 2017 and when I saw better tools come along, I thought it would be only a couple of years until the death knell tolled for CEGMA. I was wrong. It is dying…just very, very slowly. There were 119 citations last year and there have been 88 so far this year.

Academics (including former academics) obviously love to see their work cited. It is good to know that you have built tools that were actively used. But please, stop using CEGMA now! Myself and the other co-authors no longer need the citations to justify our existence.

Come back to this blog in another three years when I will no doubt post yet another post about CEGMA ('For the love of all that is holy why won't you just curl up and die!').

New BUSCO vs (very old) CEGMA

If I’m only going to write one or two blog posts a year on this blog, then it makes sense to return to my recurring theme of don’t use CEGMA, use BUSCO!

In 2015 I was foolishly optimistic that the development of BUSCO would mean that people would stop using CEGMA — a tool that we started developing in 2005 and which used a set of orthologs published in 2003! — and that we would reach ‘peak-CEGMA’ citations that year.

That didn’t happen. At the end of 2017, I again asked the question have we reached peak-CEGMA? because we had seen ten consecutive years of increasing publications.

Well I’m happy to announce that 2017 did indeed see citations to our 2007 CEGMA paper finally peak:

CEGMA citations by year (from Google Scholar)

CEGMA citations by year (from Google Scholar)

Although we have definitely passed peak CEGMA, it still receives over a 100 citations a year and people really should be using tools like BUSCO instead.

This neatly leads me to mention that a recent publication in Molecular Biology and Evolution describes an update to BUSCO:

From the introduction:

With respect to v3, the last BUSCO version, v5, features: 1) a major upgrade of the underlying data sets in sync with OrthoDB v10; 2) an updated workflow for the assessment of prokaryotic and viral genomes using the gene predictor Prodigal (Hyatt et al. 2010); 3) an alternative workflow for the assessment of eukaryotic genomes using the gene predictor MetaEuk (Levy Karin et al. 2020); 4) a workflow to automatically select the most appropriate BUSCO data set, enabling the analysis of sequences of unknown origin; 5) an option to run batch analysis of multiple inputs to facilitate high-throughput assessments of large data sets and metagenomic bins; and 6) a major refactoring of the code, and maintenance of two distribution channels on Bioconda (Grüning et al. 2018) and Docker (Merkel 2014).

Please, please, please…don’t use CEGMA anymore! It is enjoying a well-earned retirement at the Sunnyvale Home for Senior Bioinformatics Tools.

DOGMA: a new tool for assessing the quality of proteomes and transcriptomes

A new tool, recently published in Nucleic Acids Research, caught my eye this week:

The tool, by a team from the University of Münster, uses protein domains and domain arrangements in order to assess 'completeness' of a proteome or transcriptome. From the abstract…

Even in the era of next generation sequencing, in which bioinformatics tools abound, annotating transcriptomes and proteomes remains a challenge. This can have major implications for the reliability of studies based on these datasets. Therefore, quality assessment represents a crucial step prior to downstream analyses on novel transcriptomes and proteomes. DOGMA allows such a quality assessment to be carried out. The data of interest are evaluated based on a comparison with a core set of conserved protein domains and domain arrangements. Depending on the studied species, DOGMA offers precomputed core sets for different phylogenetic clades

Unlike CEGMA and BUSCO, which run against unannotated assemblies, DOGMA first requires a set of gene annotations. The paper focuses on the web server version of DOGMA but you can also access the source code online.

It's good to see that other groups are continuing to look at new ways of asssessing the quality of large genome/transcriptome/proteome datasets.

What's in a name?

Initially, I thought the name was just a word that both echoed 'CEGMA' and reinforced the central dogma of molecular biology. Hooray I thought, a bioinformatics tool that just has a regular word as a name without relying on contrived acronyms.

Then I saw the website…

  • DOGMA: DOmain-based General Measure for transcriptome and proteome quality Assessment

This is even more tenuous than the older, unrelated, version of DOGMA:

  • DOGMA: Dual Organellar GenoMe Annotator

We have not yet reached 'peak CEGMA': record number of citations in 2016

Over the last few weeks, I've been closely watching the number of citations to our original 2007 CEGMA paper. Despite making it very clear on the CEGMA webpage that is has been 'discontinued' and despite leaving a comment in PubMed Commons that people should consider alternative tools, citations continue to rise.

This week we passed a milestone with the paper getting more citations in 2016 than in 2015. As the paper's Google Scholar page clearly shows, the citations have increased year-on-year ever since it was published:

While it is somewhat flattering to see research that I was involved so highly cited — I can't imagine that many papers show this pattern of citation growth over such a long period — I really hope that 2016 marks 'peak CEGMA'.

CEGMA development started in 2005, a year that pre-dates technologies such as Solexa sequencing! People should really stop using this tool and try using something like BUSCO instead.

Making code available: lab websites vs GitHub vs Zenodo

Our 2007 CEGMA paper was received by the journal Bioinformatics on December 7, 2006. So this means it was about 9 years ago that we had to have the software on a website somewhere with a URL that we could use in a manuscript. This is what ended up in the paper:

Even though we don't want people to use CEGMA anymore, I'm at least happy that the link still works! I thought it would be prudent to make the paper link to a generic top-level page of our website (/Datasets) rather than to a specific page. This is because experience has taught me that websites often get reorganized, and pages can move.

If we were resubmitting the paper today and if we were still linking to our academic website, then I would probably suggest linking to the main website page (korflab.ucdavis.edu) and ensuring that the link to CEGMA (or 'Software') was easy to find. This would avoid any problems with moving/renaming pages.

However, even better would be to use a service like Zenodo to permanently archive the code; this would also give us a DOI for the software too. Posting code to repositories such as GitHub is (possibly) better than using your lab's website, but people can remove GitHub repositories! Zenodo can take a GitHub repository and make it (almost) permanent.

The Zenodo policies page makes it clear that even in the 'exceptional' event that a research object is removed, they still keep the DOI URL working and this page will state what happened to the code:

Withdrawal
If the uploaded research object must later be withdrawn, the reason for the withdrawal will be indicated on a tombstone page, which will henceforth be served in its place. Withdrawal is considered an exceptional action, which normally should be requested and fully justified by the original uploader. In any other circumstance reasonable attempts will be made to contact the original uploader to obtain consent. The DOI and the URL of the original object are retained.

There is a great guide on GitHub for how you can make your code citable and archive a repository with Zenodo.

We have not yet reached peak CEGMA

I was alerted to some disturbing news this weekend: CEGMA won't die!

CEGMA is a tool that I helped develop back in 2005. The first formal publication that describes CEGMA came out in 2007, and since then it has seen year-on-year growth in the number of citations to this paper.

I keep on thinking that this trend must end soon, and I was therefore hopeful that 2014 might have been the year of peak CEGMA. There were three reasons why I thought this might happen:

  1. CEGMA is no longer being developed or supported
  2. I have used the PubMed page for the CEGMA paper to advocate that people should no longer use this tool
  3. CEGMA is heavily reliant on an — increasingly out-of-date — database of orthologs that was published in 2003

However, despite my best wishes, Google Scholar has revealed to me that 2015 has now seen more citations to the CEGMA paper than in any previous year:

CEGMA citation details from Google Scholar

I'm hopeful that the development of the BUSCO software by Felipe Simão et al. will mean that 2015 will definitely be the year of peak CEGMA!

BUSCO — the tool that will hopefully replace CEGMA — now has a plant-specific dataset

With the demise of CEGMA I have previously pointed people towards BUSCO. This tool replicates most of what CEGMA did but seems to be much faster and requires fewer dependencies. Most importantly, it is also based on a much more updated set of orthologous genes (OrthoDB) compared to the aging KOGs database that CEGMA used.

The full publication of BUSCO appeared today in the journal Bioinformatics. I still haven't tried using the tool, but one critique that I have seen by others is that there are no plant-specific datasets of conserved genes to use with BUSCO. This appears to be something that the developers are aware of, because the BUSCO website now indicates that a plant dataset is available (though you have to request it).

2015-09-21 at 9.42 AM.png

Goodbye CEGMA, hello BUSCO!

Some history

CEGMA (Core Eukaryotic Genes Mapping Approach) is a bioinformatics tool that I helped develop about a decade ago. It led to a paper that outlined how you could use CEGMA to find a small number of highly conserved genes in any new genome sequence that might be devoid of annotation. Once you have even a handful of genes, then you can use that subset to train a gene finder in order to annotate the rest of the genome.

It was a bit of a struggle to get the paper accepted by a journal, and so it didn't appear until 2007 (a couple of years after we had started work on the project). For CEGMA to work it needed to use a set of orthologous genes that were highly conserved across different eukaryotic species. We used the KOGs database (euKaryotic Orthologous Groups) that was first described in 2003. So by the time of our publication we were already using a dataset that was a few years old.

The original CEGMA paper did not attract much attention but we subsequently realized that the same software could be used to broadly assess how complete the 'gene space' was of any published genome assembly. To do this, we defined a subset of core genes that were the most highly conserved and which tended to be single-copy genes. The resulting 2009 paper seemed to generate a lot of interest in CEGMA and citations to the original paper have increased every year since (139 citations in 2014).

This is good news except:

  1. CEGMA can be a real pain to install due to its dependency on many other tools (though we've made things easier)
  2. CEGMA has been very hard to continue developing. The original developer left our group about 7 years ago and he was the principle software architect. I have struggled to keep CEGMA working and updated.
  3. CEGMA continues to generate a lot of support email requests (that end up being dealt with by me).

We have no time or resources to devote to CEGMA but the emails keep on coming. It's easy to envisage many ways how CEGMA could be improved and extended; we submitted a grant proposal to do this but it was unsuccessful. One planned aspect of 'CEGMA v3' was to replace the reliance on the aging KOGs database. Another aspect of the new version of CEGMA would be to develop clade-specific sets of core genes. If you are interested in plant genomes, we should be able to develop a much larger set of plant-specific core genes.

And so?

Today I draw a line in the sand and say…

CEGMA is dead

CEGMA had a good life and and we shall look back with fond memories, but its time has passed. Please don't grieve (and don't send flowers), but be thankful that people will no longer have to deal with the software-dependency-headache of trying to get Genewise working on Ubuntu Linux.

But what now?

The future of CEGMA has arrived and it's called BUSCO.

  • BUSCO: assessing genome assembly and annotation completeness with single-copy ortholog

This new tool (Benchmarking Universal Single-Copy Orthologs) has been developed by Felipe A. Simao, Robert Waterhouse, Panagiotis Ioannidis, Evgenia Kriventseva, and Evgeny Zdobnov. You can visit the BUSCO website, have a read of the manual, or look at the supplementary online material (this is set out like a paper…I don't think the tool has been formally published yet though). Here is the first section from that supplementary material:

Benchmarking Universal Single-Copy Orthologs (BUSCO) sets are collections of orthologous groups with near-universally-distributed single-copy genes in each species, selected from OrthoDB root-level orthology delineations across arthropods, vertebrates, metazoans, fungi, and eukaryotes (Kriventseva, et al., 2014; Waterhouse, et al., 2013). BUSCO groups were selected from each major radiation of the species phylogeny requiring genes to be present as single-copy orthologs in at least 90% of the species; in others they may be lost or duplicated, and to ensure broad phyletic distribution they cannot all be missing from one sub-clade. The species that define each major radiation were selected to include the majority of OrthoDB species, excluding only those with unusually high numbers of missing or duplicated orthologs, while retaining representation from all major sub-clades. Their widespread presence means that any BUSCO can therefore be expected to be found as a single-copy ortholog in any newly-sequenced genome from the appropriate phylogenetic clade (Waterhouse, et al., 2011). A total of 38 arthropods (3,078 BUSCO groups), 41 vertebrates (4,425 BUSCO groups), 93 metazoans (1,008 BUSCO groups), 125 fungi (1,438 BUSCO groups), and 99 eukaryotes (431 BUSCO groups), were selected from OrthoDB to make up the initial BUSCO sets which were then filtered based on uniqueness and conservation as described below to produce the final BUSCO sets for each clade, representing 2,675 genes for arthropods, 3,023 for vertebrates, 843 for metazoans, 1,438 for fungi, and 429 for eukaryotes. For bacteria, 40 universal marker genes were selected from (Mende, et al., 2013).

BUSCO seems to do everything that we wanted to include in CEGMA v3 and it is based on OrthoDB, a resource that has generated a new set of orthologs (developed by the same authors). The online material includes a comparison of BUSCO to CEGMA, and also outlines how BUSCO can be much quicker than CEGMA (depending on what set of of orthologs you use).

DISCLAIMER: I have not installed, tested, or analyzed BUSCO in any way. I make no promises as to its performance, but they seem to have gone about things in the right way.