Another CEGMA post: KOGs vs CEGs and 458 vs 248

I posted another answer about CEGMA on seqanswers.com last week. I thought I'd cover this in a little more detail here (note, questions edited from how they originally appeared):

Question 1: CEGMA uses a 'kogs.fa' file — containing 2,748 proteins — to compare to a user's genome sequence. These KOGs define a set of 458 core eukaryotic genes (CEGs). Some CEGMA publications present the number of 458 CEGs that are present, others list results from the 248 most-highly-conserved CEGS. Does anyone know why kogs.fa is the default? Does it get 'curated' down to a smaller set during a CEGMA run?

The kogs.fa file represents a subset of the published set of 4,852 KOGs (euKaryotic Orthologous Groups). The KOGs database — which is still available online — describes protein groups that are present among seven different eukaryotes (not all groups are present in all species). We excluded data from the microsporidian Encephalitozoon cuniculi as it is a parasite and may have an atypical protein complement and focused on the 1,788 groups that were present in all of the remaining six species. We then applied various filtering criteria — see methods in original paper — to reduce this to the 458 KOGs (renaming this subset as CEGs in the process). We also chose just one protein to represent each species.

So that's why our kogs.fa file contains 2,748 proteins (458 x 6). CEGMA tries to determine which of these 458 CEGs are present in your input file. It's worth pointing out that the original purpose of CEGMA was to try to find a handful of genes in a genome which may lack gene annotations. Someone could then use this small gene set to train a gene finder, by which to annotate the entire genome.

After CEGMA has found which of the 458 CEGs are present, it then performs its secondary role of assessing the completeness of the gene space. To do this, it only wants to use the most conserved, and least paralogous of the 458 CEGs. Paralogy is a big issue here. The original KOGs database grouped together proteins when there were often many, many paralogs for each group. E.g. KOG0001 corresponds to the Ubiquitin gene family. Here are how many proteins occur in each of the seven species that represent this KOG:

  • Arabidopsis thaliana - 28
  • Caenorhabditis elegans - 12
  • Drosophila melanogaster - 3
  • Encephalitozoon cuniculi - 1
  • Homo sapiens - 17
  • Saccharomyces cerevisiae - 2
  • Schizosaccharomyces pombe - 1

The high degree of paralogy from A. thaliana is one reason why this KOG is not included in our subset of 248 CEGs. In contrast, KOG0018  — Structural maintenance of chromosome protein 1 (sister chromatid cohesion complex Cohesin, subunit SMC1) — is included in the 248 CEGs:

  • Arabidopsis thaliana - 1
  • Caenorhabditis elegans - 4
  • Drosophila melanogaster - 1
  • Encephalitozoon cuniculi - 1
  • Homo sapiens - 3
  • Saccharomyces cerevisiae - 1
  • Schizosaccharomyces pombe - 1

This secondary role of CEGMA uses information in the completeness_cutoff.tbl file (inside the CEGMA data directory) to narrow the 458 CEGs results down to a subset of 248 CEGs. Because different filtering criteria are used, a CEG may be classed as present in the 458 CEG set, but not in the 248 CEG set, even if it was on the list of 248 candidate CEGs.

Question 2: CEGMA output includes many KOG IDs but no descripition of what protein name/function each KOG ID represents. This makes it not so useful for annotating new genomes. Is there a lookup table somewhere?

One of the reason why we maintained KOG identifiers in the CEGMA output was so that people could, if so inclined, look up more information in the KOGs database. If you download the 'kog' file from the KOGs database, you will see each KOG has a one line description. E.g.

[O] KOG0019 Molecular chaperone (HSP90 family)
[KC] KOG0025 Zn2+-binding dehydrogenase (nuclear receptor binding factor-1)
[ZD] KOG0028 Ca2+-binding protein (centrin/caltractin), EF-Hand superfamily protein
[C] KOG0042 Glycerol-3-phosphate dehydrogenase
[T] KOG0044 Ca2+ sensor (EF-Hand superfamily)
[K] KOG0048 Transcription factor, Myb superfamily

The letters inside square brackets, represent various functional categories annotated by the KOGs database. These are as follows:

INFORMATION STORAGE AND PROCESSING
 [J] Translation, ribosomal structure and biogenesis
 [A] RNA processing and modification
 [K] Transcription
 [L] Replication, recombination and repair
 [B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING
 [D] Cell cycle control, cell division, chromosome partitioning
 [Y] Nuclear structure
 [V] Defense mechanisms
 [T] Signal transduction mechanisms
 [M] Cell wall/membrane/envelope biogenesis
 [N] Cell motility
 [Z] Cytoskeleton
 [W] Extracellular structures
 [U] Intracellular trafficking, secretion, and vesicular transport
 [O] Posttranslational modification, protein turnover, chaperones

METABOLISM
 [C] Energy production and conversion
 [G] Carbohydrate transport and metabolism
 [E] Amino acid transport and metabolism
 [F] Nucleotide transport and metabolism
 [H] Coenzyme transport and metabolism
 [I] Lipid transport and metabolism
 [P] Inorganic ion transport and metabolism
 [Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED
 [R] General function prediction only
 [S] Function unknown

Maybe this is useful to someone. However, I would remind people that KOGs was published over a decade ago (and presumably the work to generate the KOGs database begun in 2002 if not earlier). There were probably several gene annotations that were missing in the source genomes at that time, and many annotations have presumably since been updated (I bet many genes have had minor alterations to their structure).