We have not yet reached peak CEGMA

I was alerted to some disturbing news this weekend: CEGMA won't die!

CEGMA is a tool that I helped develop back in 2005. The first formal publication that describes CEGMA came out in 2007, and since then it has seen year-on-year growth in the number of citations to this paper.

I keep on thinking that this trend must end soon, and I was therefore hopeful that 2014 might have been the year of peak CEGMA. There were three reasons why I thought this might happen:

  1. CEGMA is no longer being developed or supported
  2. I have used the PubMed page for the CEGMA paper to advocate that people should no longer use this tool
  3. CEGMA is heavily reliant on an — increasingly out-of-date — database of orthologs that was published in 2003

However, despite my best wishes, Google Scholar has revealed to me that 2015 has now seen more citations to the CEGMA paper than in any previous year:

CEGMA citation details from Google Scholar

I'm hopeful that the development of the BUSCO software by Felipe Simão et al. will mean that 2015 will definitely be the year of peak CEGMA!

BUSCO — the tool that will hopefully replace CEGMA — now has a plant-specific dataset

With the demise of CEGMA I have previously pointed people towards BUSCO. This tool replicates most of what CEGMA did but seems to be much faster and requires fewer dependencies. Most importantly, it is also based on a much more updated set of orthologous genes (OrthoDB) compared to the aging KOGs database that CEGMA used.

The full publication of BUSCO appeared today in the journal Bioinformatics. I still haven't tried using the tool, but one critique that I have seen by others is that there are no plant-specific datasets of conserved genes to use with BUSCO. This appears to be something that the developers are aware of, because the BUSCO website now indicates that a plant dataset is available (though you have to request it).

2015-09-21 at 9.42 AM.png