What’s in a (gene) name?

Screengrab from WormBase database showing exon and intron structure of a gene called PPA52350

Screengrab from WormBase database showing exon and intron structure of a gene called PPA52350.

A chance conversation this week gave me a reason to check in on everybody’s favourite model organism database for nematodes…WormBase. Over two decades ago I spent four years of my life as a project manager for the UK arm of WormBase. Based at the Sanger Institute near Cambridge, we partnered with three groups in the USA (CalTech, CSHL and WashU) to maintain and develop the database that was used by thousands of nematode researchers around the world.

At the heart of WormBase was genetic and genomic data for the model organism Caenorhabditis elegans. This was the first animal to have it’s genome mapped and then sequenced. More impressively, the work to accurately fill in every last gap of the genome continued for many years after the formal publication of the genome sequence.

The 1998 Science publication described just over 19,000 protein-coding genes and a few hundred non-coding RNA genes.

I joined WormBase in 2001 and within a year or so I was tasked with developing what would frequently be referred to as simply ‘the new gene model’. Prior to any genome sequencing, there were many genes in C. elegans that had been defined by over two decades of classical genetic mapping approaches.

These genes were named in a simple, but consistent, way with three letters and then a number (this would necessarily have to be expanded to four letters in later years). E.g. unc-10 was the 10th gene named in the unc gene family (referring to UNCoordinated movements in worms with that mutation).

Enter the world of gene and genome sequencing

As the worm genome project began, genes would gain new identifiers based on in silico gene predictions made against the sequences of the cosmids, fosmids and YACs that would comprise the genome. So the first gene prediction on cosmid clone ‘T10A3’ would be named T10A3.1, the next adjacent gene prediction would be 'T10A3.2' and so on. Alternative splicing further complicated this nomenclature but it was relatively easy to append ‘a’, ‘b’, ‘c’ etc for splice variants. Luckily (for WormBase) I think 25 splice variants were the most ever discovered (see egl-8 in WormBase).

These sequence-based identifiers would sometimes become confusing when a new gene would be identified between the .1 and .2 gene predictions within a cosmid, fosmid or YAC. This messed up the original co-linearity of gene identifier with genomic location. E.g., you might end up with genes, F46H5.3 and F46H5.4 flanking a newly discovered gene which would gain a number such as F46H5.12 (you would use the next available number for that cosmid/fosmid/YAC).

When two genes go to war

The situation just continued to get more confusing. Turns out that a lot of computer gene predictions were not always right. This might mean removing a gene, or (more commonly) merging two genes into one new gene structure (where the original genes might now become splice variants. So genes C55B6.4 and C55B6.5 might become C55B6.4a and C55B6.4b (I’m using some made up examples here, but I recall it was somewhat arbitrary as to what gene identifier of the two merged genes would survive and which would die).

As genes were originally predicted on each assembled cosmid/fosmid/YAC sequence, and as these sequences overlapped — a necessary aspect of how the genome project was completed — it was possible that the same gene might be predicted independently on two overlapping cosmids/fosmids/YACs (and hence initially gain two unrelated gene identifiers).

Then of course there was the frequent situation where a classically defined gene would be matched with it’s sequence-identified counterpart. Hence, from my earlier example the unc-10 gene could also be known as T10A3.1.

Both names needed to be searchable and take you to the same entry in the database. But even more confusingly, there would be many more names that might exist. This could be because two researchers had independently mapped/published the same gene at different times, or just because some people would give the gene their own name without first referring to the literature. Generally speaking, the worm community is very good at avoiding this sort of thing but it did occasionally happen.

So unc-10 is not only the same gene as T10A3.1 but it could also be referred to as rim-1 or CELE_T10A3.1. I love that the WormBase database has preserved so many ‘curatorial remarks’ left by WormBase staff (myself included) as we tried grappling with how to merge genes and deal with other anomalies. E.g. here is a remark about the twk-18 gene:

”There are two twk-18 loci. One is CGC-approved (C24A3.6) and one is CGC-unapproved, which became an other-name of unc-58(T06H11.1a/b)”

This reflects yet another level of complexity where two different researchers or groups had used the same gene name for different genes.

Enter ‘the new gene model’

So it became my job to work out how we could:

  1. Create a new systematic identifier for every gene (yes, this reminds of me a bit of that XKCD comic)
  2. Roll this out across all of our parter groups in a way that didn’t break anything

This was a challenging problem but we eventually got there and all genes gained a new stable identifier that acted (hopefully) as a way to bring some order to how we (and others) worked with genes. These identifiers were ultimately rolled out in 2004 and an important part of ‘the new gene model’ was to allow a better way of capturing future gene births, deaths, mergers, and splits.

WormBase was based on ACeDB (A C. elegans DataBase), a bespoke database that was originally created by Richard Durbin and Jean Thierry-Mieg for the express purpose of managing C. elegans genetic data (it would go on to be used for many other organisms).

In AceDB you can always see the underlying model for any object by switching to something called the ‘Tree Display’. Amazingly (to me anyway), you can still do this in WormBase to the present day. It is buried within the Tree Display of genes, that you can see the original version control information that we added when we first migrated everything to ‘the new gene model’:

Screengrab from Tree Display view of a gene. Text in columns starts with 'Version_change' and includes a date column, a person identifier (WBPerson1971) and then a break down of an 'Event' which explains the detail of how the first import

Screengrab from Tree Display view of a gene in WormBase showing the Version change information for a gene.

In this case ‘geneace’ was the local ACeDB database that we used at the Sanger Institute to store information regarding all of the classically mapped genes. I was (and perhaps still am?) ‘WBPerson1971'; this means I am forever indirectly associated with most of the initial gene set of C. elegans in WormBase! This now brings me to the fundamental point that I wanted to make with this very long blog post…

What was the format of the new gene identifier?

I remember wanting to borrow from the format that the Ensembl genome database had established. Gene identifiers should have a fixed width by using leading zeroes to pad out the identifier (this makes it much friendly to computer programs that have to process such data). I also wanted there to be a simple text prefix to the identifier (Ensembl has gene IDs such as ENSG00000139618: ENSG = ENSembl Gene).

I went with ‘WBGene’ for the prefix part which just left the question of how many digits to reserve for the gene space. At the time I was working on this I think the genome of a related nematode C. briggsae had been finished. As I recall, WormBase contained that data as well as some very limited data for a few other nematodes (including some classically mapped genes as well as many sequence derived genes). So I knew that WormBase would grow beyond the 20,000 or so genes there were at the time of the C. elegans genome publication.

I opted for eight digits in the identifier, allowing for 99,999,999 genes. At the time this did seem like overkill, but I would rather err on the side of caution than give future bioinformaticans major headaches, e.g. what if we had gone for a five-digit number and then it turns out that we needed to store over 100,000 genes?

All of this meant that our unc-10 gene from before could now also be known as WBGene00006750.

An end to WormBase and an end to ‘the new gene model’

This week I wanted to have a look to see what the highest gene identifier was that had been added to WormBase. In taking a look at the website, I was saddened to see that WormBase came to an end in July 2025 with the 298th release of the database.

Thankfully, most of the data will live on in the Alliance of Genome Resources which is a consortium of seven model organism databases. It’s not clear to me whether the alliance will end up with yet another tier of identifier that will span all of the species that are represented. I note that unc-10 in their database currently gains a slight tweak with an extra prefix of ‘WB:’ to become WB:WBGene00006750 (presumably because they have the extra challenge of needing to know which database an identifier came from).

I wonder whether the new gene model will live on in the Alliance of Genome Resources and whether those eight digits of reserved identifier space will continue to fill up. Given how easy (and relatively cheap) it has become to sequence genomes these days, someone might decide they want to sequence the genomes of all nematodes (there are about 25,000 described species, but there could be many more).

The last gene in WormBase

And so I can finally bring this post to a close with the reveal of the last gene identifier that made it into the last release of WormBase:

WBGene00311061 (also known as PPA52350) is a protein-coding gene from the nematode Pristionchus pacificus. It is the highest number gene that I could find and it reflects the fact that the gene count of WormBase has increased over 15-fold since my time there.

I bet this is due to a lot of other species being added, but also due to an explosion in RNA genes in C. elegans. I’m glad that my choice of gene identifier all those years ago has survived.

In writing this blog post, it’s been fun a lot of fun to take this trip down memory lane. In my research for this post, I came across a recent video (March 2026) by the Alliance of Genome Resources which explains much more about how worm and fly genes are named. Tim Schedl is the gene name curator for the WormBase data within the Alliance dataset who took over from Jonathan Hodgkin who I worked very closely with during my time at WormBase.

If there is are any lessons to be learned from all of this (especially to any young bioinformaticians out there), I would say:

  1. It’s hard to design databases for biological data. Biology is messy and will produce surprises that might only emerge years after you define a schema that thought would capture all possible edge cases (biology will then laugh at your schema)
  2. If you can, try to future proof things as much as possible
  3. Do not - under any circumstances - allow asterisks or question marks to be part of a valid gene name! There were originally some genes in the geneace database that included asterisks which caused all manner of problems in ACeDB as asterisks were also used as a wildcard search operator.

Transcriptional noise, isoform prediction, and the utility of mass spec data in gene annotation

The human genome may be 80% functional or 8.2% functional. Maybe it's 93.7% functional or only 6.1%. I guess that all we know for sure is that it is not 0% functional (although my genome on a Monday morning may provide evidence to the contrary).

Transcript data can be used to ascribe some sort of functionality to a genome and, in an ideal world, we would sequence full-length cDNAs for every gene. But in the less-than-ideal world we often end up sequencing lots of small bits of RNA using an ever-changing set of technologies. ESTs, SAGE, CAGE, RACE, MPSS, and RNA-Seq have all been used to provide evidence for where genes are and how highly they are being expressed.

Having some transcriptional evidence is (usually) better than not having any transcriptional evidence, but it doesn't necessarily imply functionality. A protein-coding gene that is transcribed may not be translated. Transcript data is used in gene annotation to add new genes, especially in the case of a first-pass annotation of a new genome. But in established genomes, it is probably used more to annotate transcript isoforms (e.g. splice variants). This can lead to a problem for the end users of such data…how to tell if all isoforms are equally likely?

Consider the transcript data for the rpl-22 gene in Caenorhabditis elegans. This gene has two annotated splice variants and there is indeed EST evidence for both variants, but it is a little bit unbalanced:

This gene encodes the large ribosomal subunit protein…a pretty essential protein! Notice how the secondary isoform (shown on top) a) encodes for a much shorter protein and b) has very little transcript evidence. In my mind, this secondary isoform is the result of 'transcriptional noise'. Maybe a couple of lucky ESTs captured the transcript in the process of heading towards destruction via nonsense-mediated decay? It seems highly unlikely that this secondary transcript gives rise to a functional protein though someone who is new to viewing data like this might initially consider each isoform as equally valid.

If we turn on some additional data tracks to look at protein homology to human (shown in orange) and mass spectromety data from C. elegans (shown in red) it becomes clear that all of the evidence is really pointing towards just one functional isoform:

Indeed mass spec data has the potential to really clean up a lot of noisy gene annotations. In light of this I was very happy to see this new paper published in the Journal of Proteome Research (where talented up-and-coming scientists publish!):

Pooling data from 8 mass spec analyses of human data, the authors attempted to see how much protein support there was for the different annotated isoforms of the human genome. They could reliably map peptides to about two-thirds of the protein-coding genes from the GENCODE 20 gene set (Ensembl 76). What did they find?

We identified alternative splice isoforms for just 246 human genes; this clearly suggests that the vast majority of genes express a single main protein isoform.

They also found that the mass spec data was not always in agreement with the dominant isoforms that can be predicted from RNA-Seq data:

…part of the reason for this will be that more RNAseq reads map to longer sequences, it does suggest that either transcript expression is very different from protein expression for many genes or that transcript reconstruction methods may not be interpreting the RNAseq reads correctly.

The headline conclusion that mass spec evidence only supports alternate isoforms for 1.2% of human genes is thought provoking. It suggests to me that we should be careful in relying too heavily on gene annotations which describe large numbers of isoforms mostly on the basis of transcript data. Paraphrasing George Orwell:

All isoforms are equal, but some isoforms are more qual than others

Google and WormBase: these are not the search results you're looking for

Today I wanted to look up a particular gene in the WormBase database. Rather than go to the WormBase website, I thought I would just search Google for the word 'wormbase' followed by the gene name (rpl-22). Surely this would be enough to put the Gene Summary page for rpl-22 at the top of the results?

Sadly no. Here are the results that I was presented with:

All ten of these results include information from the WormBase database regarding the rpl-22 gene and/or link to the WormBase page for the gene. But there are no search results for wormbase.org at all.

Very odd. Is WormBase not allowing themselves to be indexed by search engines? I see a similar lack of wormbase.org results when using bing, Ask, or DuckDuckGo. However, if I search Google for flybase rpl-22 or pombase rpl-22 I find the desired fly/yeast orthologs of the worm rpl-22 gene as the top Google result.

When is a citation not a citation?

Today I received a notification from Google Scholar that one of my papers had been cited. I often have a quick look at such papers to see how our work is being referenced. The article in question was from the Proceedings of the 3rd Annual Symposium on Biological Data Visualization: Data Analysis and Redesign Contests:

FixingTIM: interactive exploration of sequence and structural data to identify functional mutations in protein families

The paper describes a tool that helps "identify protein mutations across a family of structural models and to help discover the effect of these mutations on protein function". I was a bit surprised by this because this isn't a topic that I've published on. So I looked to see what paper of mine was being cited and how it was being cited. Here is the relevant sentence from the background section of the paper:

To improve the exploration process, many efforts have been made, from folding the sequences through classification [1,2], to tools for 3D view exploration [3] and to web-based applications which present large amounts of information to the users [4].

Citation number 2 is the paper on which I am a co-author:

  • Chen N, Harris TW, Antoshechkin I, Bastiani C, Bieri T, Blasiar D, Bradnam K, Canaran P, Chan J, Chen C, Chen WJ, Cunningham F, Davis P, Kenny E, Kishore R, Lawson D, Lee R, Muller H, Nakamura C, Pai S, Ozersky P, Petcherski A, Rogers A, Sabo A, Schwarz EM, Van Auken K, Wang Q, Durbin R, Spieth J, Sternberg PW, Stein LD: Wormbase: A comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res 2005, 33(1):383-389.

The cited paper simply describes the WormBase database and includes only a passing reference to the fact that WormBase contains some links to protein structures (when known), but that's about it. The WormBase paper doesn't mention 'folding' or 'classification' anywhere, which makes it seem a really odd choice of paper to be cited. It makes me wonder how many other papers end up gaining seemingly spurious citations like this one.