Why the UCSC Genome Browser FTP site is one of my least favorite places to visit

If you visit the Golden Path directory of the UCSC Genome Browser FTP site (ftp://hgdownload.cse.ucsc.edu//apache/htdocs/goldenPath), you will come across the following quirks:

  1. Multiple genomes for the same species are not grouped together under a parent directory for each species, so the number of items in this directory (~250) gives no indication of the number of species represented (~125).
  2. Species identifiers are ambiguous. You have to know that 'mm9' refers to Mus musculus and not Macaca mulatta
  3. Species identifiers are also inconsistent. Some species get just two lower-case characters (e.g. 'mm' = Mus musculus, 'dm' = Drosophila melanogaster) whereas most get six characters (e.g. 'felCat' = Felis catus, 'sacCer' = Saccharomyces cerevisiae).
  4. Humans, hallowed species that we are, simply get 'hg' (presumably for 'human genome').
  5. The six-character format reverses centuries (!) of naming convention by making the genus part of the name start with a lower-case character and the specific part of the name start with an upper-case character.
  6. Some species also have date-versioned directories in addition to numerical-suffixed directories. So do you want to download the 'hg7' version of the human genome or instead get the 'hg7oct2000_oo21' (don't ask me what the 'oo_21' part means)?

If you want a challenge, try writing some bioinformatics software that goes from the Latin name for a species to the correct directory on their FTP site! I guess the UCSC team are going to hope that six characters is enough to uniquely identify any future species that end up here. So I hope they don't start sequencing too many more Drosophila species. E.g.

Compare this madness — and it is madness — to the calming orderliness of the Ensembl Genomes FTP site (e.g. ftp://ftp.ensemblgenomes.org//pub/release-23/metazoa/fasta):

A view from UCSC Genome Browser FTP site…

A view from UCSC Genome Browser FTP site…

…compared to a view from the Ensembl Genomes FTP site

I think the key point from this story is that a lot of bioinformatics research can be hard enough without the added complexities of working with unstructured data. When you start building any new resource in bioinformatics, be it an FTP site, web site, GitHub repository, you should plan for the future! I.e. expect things to expand, grow, and greatly increase in complexity.

Even if you intend for a resource to only ever contain information for a single species, assume that it will end up containing hundreds of species. You should also assume that people may wish to automate the querying of your data. If you plan for these things from the moment you start building your resource, you might make some bioinformaticans happy — and you certainly don't want to make us angry…you wouldn't like us when we're angry.

How does the popularity of the UC Davis Genome Center vary with geographic location?

If I perform a Google search for the two words genome center, I see that the UC Davis Genome Center (henceforth UCDGC) is the top hit. But this is to be expected because Google has been personalizing search results for some time now, so this result is obviously tailored to me (if you didn't know, I work at the UCDGC).

If you are signed in to Google when you perform a search, the results will be heavily influenced by your search history and by what Google knows about you and your interests. Even if you sign out of Google, the search engine giant can track some information via cookies. Even if you disable cookies or use a private browsing mode, Google is still altering your search results because it knows your location (even if only approximately).

This explains why I will almost always see UCDGC as the top result when I search for 'genome center'. To get around this, I could use a search engine that doesn't track my activity, or I could use a private browsing mode in combination with a little-known feature of Google, that of changing your search location. It's possible to perform a search as if I was located in any major city or state in America.

So this allows me to see how often the UCDGC appears in the #1 position as I move around the country. I first performed a search for 'genome center' as if I was located in each state (e.g. set location to be 'Alabama', 'Alaska', 'Arkansas' etc.):

Ranking of UC Davis Genome Center among Google search results when searching for 'genome center' in each state

When you search for 'genome center', the UCDGC is the top search result in every state! One caveat to this approach is that it may not be all that meaningful to set your location to be an entire state. So I repeated the approach but this time I set my location to be the most populous city in each state:

Ranking of UC Davis Genome Center among Google search results when searching for 'genome center' in the most populous city of each state (as indicated by position of marker within each state). 

This shows that UCDGC is the #1 search result for cities in 36/50 states. The places where UCDGC is not #1 are all cities that have a notable genome center of their own (or are located close to one). A few notes relating to this:

  1. The New York Genome Center dominates results not only in New York City (NY), but also in Newark (NJ), Bridgeport (CT), and Philadephia (PA)
  2. The #1 result in Baltimore (MD) is for the Institute of Genome Sciences at the University of Maryland
  3. St. Louis (MO) sees The Genome Institute at Washington University take the top spot
  4. In the north west, a search from Seattle gives the Seattle Structural Genomics Center for Infectious Disease as the most popular result. But if you head to Spokane (Washington's 2nd city), then the UCDGC becomes the #1 result again
  5. In Texas, the Department of Genomic Medicine at the Houston Methodist Research Institute, pushes UCDGC to 4th place. However, move to San Antonio or Dallas and the UCDGC regains first place
  6. Chicago (IL) has the Institute for Genomics and Systems Biology at #1
  7. In Minneapolis (MN) it is the University of Minnesota Genomics Center who is the top dog
  8. The home of the King (Memphis, TN) is also home to the W. Harry Feinstone Center for Genomic Research which takes the #1 position. Once again, if you move to this state's second city (Nashville), the UCDGC regains the top spot in the search results.
  9. Las Vegas, NV is home to the University of Nevada Las Vegas Genomics Core Facility. Moving to Nevada's second city (Henderson) puts UCDGC back on top.
  10. In Salt Lake City (UT) you can find the Utah Genome Depot at the University of Utah dominating the rankings.
  11. Finally, in Atlanta (GA), it is the Emory University Integrated Genomics Core which denies the UCDGC the #1 position

The UC Davis Genome Center is not only the top hit when you search for 'genome center' in various locations in the USA. If you use the Google location option to go truly global, you will see that we rank as the top search result for 'genome center' in London, Paris, Berlin, Moscow, Dehli, Seoul, Cairo, Buenos Aires, Bogota, Rio de Janeiro, Cape town, Kuala Lumpur, and Sydney!

While this could all be the result of UC Davis spending millions of dollars to adopt search engine optimization strategies to unduly influence our position in the search results, I prefer to believe that it reflects our reputation for world-class genomics research and training.

Real bioinformaticians and old bioinformaticians

A passing mention of the phrase 'real bioinformaticians' by Michael Hoffman (@michaelhoffman) yesterday, prompted me to elevate the concept to be worthy of its own hashtag. This is what happened next:

You will notice that Sara G's response (@sargoshoe) humorously introduced the concept of #oldbioinformaticians, and this in turn spawned an even longer set of tweets (see below). I think that many of the more — how shall we put this — wise and distinguished members of the bioinformatics community, enjoyed the chance for a trip down memory lane.

Musical encores in bioinformatics and other sciences

I've previously flagged a few examples of independently developed bioinformatics software tools that share the same name. My recent post about the JABBA-award winning software called MUSIC prompted some people to let me know that this is another name that has been used repeatedly by different groups.

So thanks to Nicolas Robine and commenter LMikeF, we can see that MUSIC is a very popular name for bioinformatics tools:

  1. MuSiC: a tool for multiple sequence alignment with constraints (2004)
  2. RE-MuSiC: a tool for multiple sequence alignment with regular expression constraints (2007)
  3. MuSiC: identifying mutational significance in cancer genomes (2012)
  4. MUSIC: Identification of Enriched Regions in ChIP-Seq Experiments using a Mappability-Corrected Multiscale Signal Processing Framework (2014)
  5. MUSiCC: Towards an accurate estimation of average genomic copy-numbers in the human microbiome (2014)

The first two publications sadly suffer from link rot and the provided URLs no longer work. These two publications are also by the same group, which begs the question, what would they call a 3rd iteration of their software (RE-RE-MuSiC?).

A little bit of additional searching reveals that MUSIC is a popular name in other scientific endeavors as well:

  1. MUSIC: MUltiScale Initial Conditions — software to generate initial conditions for cosmological simulations
  2. MUSIC: MUltiScale SImulation Code — fluid dynamics software: warning this website will make you nauseous!
  3. MUSIC: Muerte Subita en Insufficiencia Cardiaca — a longitudinal study to assess risk predictors of death inpatients with heart failure
  4. MUSIC: MUtation-based SQL Injection vulnerabilities Checking tool — a tool to help check for vulnerabilities in web based applications

I guess people like the name MUSIC and will go to almost any lengths to make an acronym/initialism for it. 

The Graphical Fragment Assembly (GFA) format

Shaun Jackman added a comment to my previous post about the ongoing development of a new format by which to represent genome assemblies. I thought I would reproduce this in a separate blog post in order to bring this issue to more attention.

But first, a quick reminder that currently nearly all genome assemblies are ultimately stored as DNA sequences in FASTA format. This format was developed over 25 years ago and is not best suited to representing a genome assembly.

One obvious reason for this is that we commonly sequence the genomes of diploid individuals who have two genomes present in every cell (one derived from each parent). We often know that a particular region of the genome should be represented as sequence X or sequence Y, but the FASTA format requires you to choose one or the other.

There has already been one effort to develop a new file format to best represent the variation present in an assembly, and a final specification was formalized. However, this FASTG format has seemingly not been widely adopted by the community (at least, not that I know of).

At this point, I will simply reproduce Shaun's comment from the earlier post (minor edits made to restructure some of the links and layout):

There has been three fantastic blog posts in the past three months on the topic of devising a common file format for a sequence overlap graph to enable modular assembly pipelines.

Heng Li (@lh3lh3) has proposed a Grapical Fragment Assembly (GFA) file format. An implementation will be included in the next release of ABySS. Jared Simpson (@jaredtsimpson) is working on an implementation for String Graph Assembler (SGA). I hope that other implementations will follow.

  1. Dear assemblers, we need to talk … together by Páll Melsted (@pmelsted) and Michael R. Crusoe (@biocrusoe). tl;dr we need a common file format for contig graphs and other stuff too
  2. A proposal of the Grapical Fragment Assembly format by Heng Li and…
  3. First update on GFA by Heng Li

Please add you comments to this posting with your thoughts on the GFA file format. 

There are a lot of comments on the two blog posts by Heng and I tweeted my (minor) concerns regarding how this format proposal has developed. This led to some further discussion on twitter, some of which I have storified:

I hope that Heng takes up Shaun's suggestion to move the spec to GitHub. The FASTG proposal used a mailing list to help focus some of the discussion and I feel that something similar needs to happen to ensure that any future debate about the GFA format is productive.

101 questions with a bioinformatician #14: Shaun Jackman

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Shaun Jackman is a PhD student working on various problems relating to genome assembly at the University of British Columbia. Specifically, Shaun works under the supervision of  İnanç Birol in the Bioinformatics Technology Lab at the BC Cancer Agency's Genome Sciences Centre in Vancouver. You may know him for his work in writing and directing the 1989 smash hit The Abyss, which was later developed into a popular genome assembler.

In addition to being a talented bioinformatician who has contributed to lots of useful software, he is also a very patient guy. I say this because he has been waiting for me to publish this interview for over 3 months (my sincere apologies for the delay, I will try to make this series a regular feature once again).

You can find out more about Shaun from his website or by following him on twitter (@sjackman). And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

I’m excited to see the increasing popularity of enabling reproducible research using tools such as R Markdown and iPython Notebook. After reading a paper, it should be straight forward to download the raw data, install the necessary software, reproduce the results and regenerate the figures. I’m really hoping that we get to that point.

I'm also happy to see more interaction between developers and users using revision control web sites, such as GitHub.

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

Most genome sequence assembly tools are structured as a pipeline: for example, count the k-mers of a set of reads, construct a de Bruijn graph of those k-mers, remove k-mers caused by sequencing errors, identify heterozygous sequences and finally assemble contigs.

It should be possible to mix and match these individual components from different assemblers to create new assembly pipelines that are hybrids of existing tools. Not only could it create a better overall assembler, but it could identify which of the individual components of the various assemblers are strongest. It should be encouraged to improve on an individual component without having to reinvent an entire assembly pipeline.

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Learn to use R and R Studio to visualize your data. I wasted a lot of time making ugly figures with inferior tools. Use Make to automate every analysis pipeline. No pipeline is too small or too large. A one-off analysis never is.

 

100. What's your all-time favorite piece of bioinformatics software, and why?

  • I use Make and R nearly every day.
  • I like Heng Li‘s tools because they stick to the principle of doing one thing well.
  • I’m fond of ABySS, for one because it was the first bioinformatics tool that I helped to develop, but primarily because it’s designed as a pipeline of reusable modular tools that use standard (when possible) file formats, all bound together by a Makefile.

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

I’m an N, because it leaves all options open.

If music be the food of bioinformatics, play on: time for a new JABBA award

It's been a while, but it is time for another JABBA award. I occasionally hand out these awards whenever I see Just Another Bogus Bioinformatics Acronym (though the awards also apply to initialisms). You can see details of many previous JABBA award winners elsewhere on this site.

The latest recipient of a JABBA award is this new paper published in Genome Biology:

As soon as I saw the title, I assumed that MUSIC would be an acronym or initialism but the paper itself does not explain what the name means. My hopes were raised that maybe this was just a simple, pleasant-sounding, name for a piece of software — a name which didn't try to clumsily form itself from a desperate grab of various letters from a longer description of the software. I was just about to thank them for the MUSIC when I thought I would check the software page that is linked to from the paper. That's when I saw this:

They chose not to mention this in the paper, but MUSIC is derived from Multiscale Enrichment Calling! The use of the letter 'i' from the word 'Enrichment' is what really escalates this to the status of a bogus acronym. By this logic they could have called this tool many other things (including 'MENTAL'). I'm a bit surprised that they didn't go for the slightly longer 'MUSICAL' (or even 'MUSICALL' if they wanted a genuinely unique name).

Although this software tool will be hard to find with a Google Search — unless people specifically search for 'ChIP-seq MUSIC —  we should at least be thankful that no-one else has ever published a bioinformatics tool called MUSIC. Oh wait, they have.

Update 2014-10-10: Nicolas Robine (@notsojunkDNA) has alerted me to the fact that there is also some bioinformatics software called Genome MuSiC.

Academic link rot seems to be getting faster: should a published URL last more than 100 days?

Consider this paper that was recently published in the journal Bioinformatics, and which showed up today in my RSS feed:

Presumably it is a typo when the journal says that it was received on November 14th 2014:

I'll assume that this is meant to be 2013! The paper first appeared online on June 13th 2014, just 103 days ago. The text of this paper links to some software that should be available at http://ww2.cs.mu.oz.au/∼gwong/LICRE. Except that this URL doesn't work. Neither does http://ww2.cs.mu.oz.au/∼gwong/. Only when I visit http://ww2.cs.mu.oz.au/ do I discover the following:

The new website for the merged departments says that the merger happened in 2012, and this is confirmed by the redirection page which has a date of 18th January 2012. It is also confirmed by looking at the Internet Archive's Wayback Machine which shows that the redirection page has been in place since at least February 2012. 

All of which suggests that the software link in the paper may have not even worked properly at the time they submitted the manuscript. I'm sure there are other similar examples of speedy link rot, but this seems particularly striking. Especially since a search for 'LICRE' on the new website doesn't return any hits (nor can I find any mention of it on Google or various search engine caches).

I will contact the lead author to let him know about the disappearance of the software. In the meantime, I'll remind people of this previous post of mine:

Update 2014-09-24 19.52:  I heard back from the author, the LICRE code is now at https://sites.google.com/site/licrerepository/

Another CEGMA post: KOGs vs CEGs and 458 vs 248

I posted another answer about CEGMA on seqanswers.com last week. I thought I'd cover this in a little more detail here (note, questions edited from how they originally appeared):

Question 1: CEGMA uses a 'kogs.fa' file — containing 2,748 proteins — to compare to a user's genome sequence. These KOGs define a set of 458 core eukaryotic genes (CEGs). Some CEGMA publications present the number of 458 CEGs that are present, others list results from the 248 most-highly-conserved CEGS. Does anyone know why kogs.fa is the default? Does it get 'curated' down to a smaller set during a CEGMA run?

The kogs.fa file represents a subset of the published set of 4,852 KOGs (euKaryotic Orthologous Groups). The KOGs database — which is still available online — describes protein groups that are present among seven different eukaryotes (not all groups are present in all species). We excluded data from the microsporidian Encephalitozoon cuniculi as it is a parasite and may have an atypical protein complement and focused on the 1,788 groups that were present in all of the remaining six species. We then applied various filtering criteria — see methods in original paper — to reduce this to the 458 KOGs (renaming this subset as CEGs in the process). We also chose just one protein to represent each species.

So that's why our kogs.fa file contains 2,748 proteins (458 x 6). CEGMA tries to determine which of these 458 CEGs are present in your input file. It's worth pointing out that the original purpose of CEGMA was to try to find a handful of genes in a genome which may lack gene annotations. Someone could then use this small gene set to train a gene finder, by which to annotate the entire genome.

After CEGMA has found which of the 458 CEGs are present, it then performs its secondary role of assessing the completeness of the gene space. To do this, it only wants to use the most conserved, and least paralogous of the 458 CEGs. Paralogy is a big issue here. The original KOGs database grouped together proteins when there were often many, many paralogs for each group. E.g. KOG0001 corresponds to the Ubiquitin gene family. Here are how many proteins occur in each of the seven species that represent this KOG:

  • Arabidopsis thaliana - 28
  • Caenorhabditis elegans - 12
  • Drosophila melanogaster - 3
  • Encephalitozoon cuniculi - 1
  • Homo sapiens - 17
  • Saccharomyces cerevisiae - 2
  • Schizosaccharomyces pombe - 1

The high degree of paralogy from A. thaliana is one reason why this KOG is not included in our subset of 248 CEGs. In contrast, KOG0018  — Structural maintenance of chromosome protein 1 (sister chromatid cohesion complex Cohesin, subunit SMC1) — is included in the 248 CEGs:

  • Arabidopsis thaliana - 1
  • Caenorhabditis elegans - 4
  • Drosophila melanogaster - 1
  • Encephalitozoon cuniculi - 1
  • Homo sapiens - 3
  • Saccharomyces cerevisiae - 1
  • Schizosaccharomyces pombe - 1

This secondary role of CEGMA uses information in the completeness_cutoff.tbl file (inside the CEGMA data directory) to narrow the 458 CEGs results down to a subset of 248 CEGs. Because different filtering criteria are used, a CEG may be classed as present in the 458 CEG set, but not in the 248 CEG set, even if it was on the list of 248 candidate CEGs.

Question 2: CEGMA output includes many KOG IDs but no descripition of what protein name/function each KOG ID represents. This makes it not so useful for annotating new genomes. Is there a lookup table somewhere?

One of the reason why we maintained KOG identifiers in the CEGMA output was so that people could, if so inclined, look up more information in the KOGs database. If you download the 'kog' file from the KOGs database, you will see each KOG has a one line description. E.g.

[O] KOG0019 Molecular chaperone (HSP90 family)
[KC] KOG0025 Zn2+-binding dehydrogenase (nuclear receptor binding factor-1)
[ZD] KOG0028 Ca2+-binding protein (centrin/caltractin), EF-Hand superfamily protein
[C] KOG0042 Glycerol-3-phosphate dehydrogenase
[T] KOG0044 Ca2+ sensor (EF-Hand superfamily)
[K] KOG0048 Transcription factor, Myb superfamily

The letters inside square brackets, represent various functional categories annotated by the KOGs database. These are as follows:

INFORMATION STORAGE AND PROCESSING
 [J] Translation, ribosomal structure and biogenesis
 [A] RNA processing and modification
 [K] Transcription
 [L] Replication, recombination and repair
 [B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING
 [D] Cell cycle control, cell division, chromosome partitioning
 [Y] Nuclear structure
 [V] Defense mechanisms
 [T] Signal transduction mechanisms
 [M] Cell wall/membrane/envelope biogenesis
 [N] Cell motility
 [Z] Cytoskeleton
 [W] Extracellular structures
 [U] Intracellular trafficking, secretion, and vesicular transport
 [O] Posttranslational modification, protein turnover, chaperones

METABOLISM
 [C] Energy production and conversion
 [G] Carbohydrate transport and metabolism
 [E] Amino acid transport and metabolism
 [F] Nucleotide transport and metabolism
 [H] Coenzyme transport and metabolism
 [I] Lipid transport and metabolism
 [P] Inorganic ion transport and metabolism
 [Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED
 [R] General function prediction only
 [S] Function unknown

Maybe this is useful to someone. However, I would remind people that KOGs was published over a decade ago (and presumably the work to generate the KOGs database begun in 2002 if not earlier). There were probably several gene annotations that were missing in the source genomes at that time, and many annotations have presumably since been updated (I bet many genes have had minor alterations to their structure).