DOGMA: a new tool for assessing the quality of proteomes and transcriptomes

A new tool, recently published in Nucleic Acids Research, caught my eye this week:

The tool, by a team from the University of Münster, uses protein domains and domain arrangements in order to assess 'completeness' of a proteome or transcriptome. From the abstract…

Even in the era of next generation sequencing, in which bioinformatics tools abound, annotating transcriptomes and proteomes remains a challenge. This can have major implications for the reliability of studies based on these datasets. Therefore, quality assessment represents a crucial step prior to downstream analyses on novel transcriptomes and proteomes. DOGMA allows such a quality assessment to be carried out. The data of interest are evaluated based on a comparison with a core set of conserved protein domains and domain arrangements. Depending on the studied species, DOGMA offers precomputed core sets for different phylogenetic clades

Unlike CEGMA and BUSCO, which run against unannotated assemblies, DOGMA first requires a set of gene annotations. The paper focuses on the web server version of DOGMA but you can also access the source code online.

It's good to see that other groups are continuing to look at new ways of asssessing the quality of large genome/transcriptome/proteome datasets.

What's in a name?

Initially, I thought the name was just a word that both echoed 'CEGMA' and reinforced the central dogma of molecular biology. Hooray I thought, a bioinformatics tool that just has a regular word as a name without relying on contrived acronyms.

Then I saw the website…

  • DOGMA: DOmain-based General Measure for transcriptome and proteome quality Assessment

This is even more tenuous than the older, unrelated, version of DOGMA:

  • DOGMA: Dual Organellar GenoMe Annotator

Beyond Generations: My Vocabulary for Sequencing Tech

Many writers have attempted to divide Next Generation Sequencing into Second Generation Sequencing and Third Generation Sequencing. Personally, I think it isn't helpful and just confuses matters. I'm not the biggest fan of Next Generation Sequencing (NGS) to start with, as like "post-modern architecture" (or heck, "modern architecture") it isn't future-proofed.

Keith Robison gives an interesting deep dive on how sequencing technologies have been named and potentially could be named.

This post reminded me of my previous takes on the confusing, and inconsistent labelling of these technologies:

Reflections on the 2019 Festival of Genomics conference in London


For the third year in a row, I attended the Festival of Genomics conference in London. This year saw the conference change venue, moving from the ExCel Arena to the Business Design Centre in Islington.

The new venue was notably smaller leading to many sessions being heavily overcrowded. There were also fewer 'fun' activities compared to previous years. No graffiti wall and no recharging stations (massage stands and power points for phones).

The opening keynote was given by Professor Mark Caulfield (Chief Scientist at Genomics England

From 100K to 500K

Reflecting on the completion of the 100,000 Genomes Project, Professor Caulfield revealed that the 100,000th genome was completed at 2:40 am on the 2nd December.

He also shared details that at the peak, the project was completing 6,000 genomes a month and it has now reached 103,311 genomes.

The next phase will see 500,000 genomes completed within the NHS over the next five years, with an 'ambition' to go on to sequence five million genomes.

Looking at the global picture of human genome sequencing, Professor Caulfield projected that there will be 60 million completed genomes by 2023.

I wrote more about the conference in a blog post for The Institute of Cancer Research:

Damn and blast…I can't think of what to name my software


As many people have pointed out on Twitter this week, there is a new preprint on bioRxiv that merits some discussion:

The full name of the test that is the subject of this article is the Bron/Lyon Attention Stability Test. You have to admit that 'BLAST' is a punchy and catchy acronym for a software tool.

It's just a shame that is also an acronym for another piece of software that you may have come across.

It's a bold move to give your software the same name as another tool that has only been cited at least 135,000 times!

This is not the first, nor will it be the last, example of duplicate names in bioinformatics software, many of which I have written about before.