An automated attempt to identify duplicated software names

From time to time I've been pointing out instances of duplicated software names in bioinformatics. I assume that many people reuse the name of an existing tool simply because they haven't first checked — or checked thoroughly — to see if someone else has already published a piece of software with the same name.

I am not alone in my concern over this issue and Neil Saunders (@neilfws on twitter) has gone one step further and written some code to try to track down instances of duplicated names. He recently wrote a post on his What You're Doing Is Rather Desperate blog to explain more:

In the article, he describes how he used a Ruby script to parse the titles of articles downloaded from PubMed Central and then feeds this info into an R script to identify examples of duplicated software names. The result is a long list of duplicated software names (available on GitHub).

I'm not surprised to learn that generic names like 'FAST' and 'PAIR' appear on this list. However, I was surprised to see that in the same year (2011), two independent publications both decided to name their software 'COMBREX':

BUSCO — the tool that will hopefully replace CEGMA — now has a plant-specific dataset

With the demise of CEGMA I have previously pointed people towards BUSCO. This tool replicates most of what CEGMA did but seems to be much faster and requires fewer dependencies. Most importantly, it is also based on a much more updated set of orthologous genes (OrthoDB) compared to the aging KOGs database that CEGMA used.

The full publication of BUSCO appeared today in the journal Bioinformatics. I still haven't tried using the tool, but one critique that I have seen by others is that there are no plant-specific datasets of conserved genes to use with BUSCO. This appears to be something that the developers are aware of, because the BUSCO website now indicates that a plant dataset is available (though you have to request it).

2015-09-21 at 9.42 AM.png

Anatomy of an mainstream science piece

A great blog post by Ewan Birney that describes the process of writing an commentary piece for the Guardian newspaper, and which also discusses the need for more involvement of scientists in the public discussion of science. I like the concluding remarks:

As practicing scientists, we need to continue laying the groundwork started long ago by many others…engaging consistently and non-judgmentally with our communities and policymakers about out work. There is a real task ahead of us in providing an accessible way for people to digest this information. We should take every opportunity to communicate on every level, from the most basic to state of the art. Only then can society really use the hard-earned information gleaned from genetics appropriately, and for the greater good.

A long time ago in a Galaxy browser far, far away…

Today marks exactly 10 years since the first publication that describes the popular Galaxy platform:

The tremendous success of the Galaxy project can be summarized by the following two graphs (taken from the statistics page on the Galaxy Wiki):

New user registrations on Galaxy main site

 

Publications that reference or mention Galaxy

I'm sure that the Galaxy team have a more official date to use as their anniversary, but I'll mark the 10th year since their initial publication to say congratulations and 'Happy Birthday'! I hope that Galaxy can emerge unscathed from those difficult 'teenage years' that lie just around the corner!

Twitter competition: win a signed copy of Bioinformatics Data Skills by Vince Buffalo

I have an extra copy of the fantastic Bioinformatics Data Skills book by Vince Buffalo (who you should all be following on twitter at @vsbuffalo). I've come up with a fun little competition to let someone have a chance of winning this signed copy.

All you have to do is write a tweet that includes the #ACGT hashtag (so I can track all of the answers), which provides the following information:

Name a useful bioinformatics data skill

The winner will be chosen randomly — hopefully by a powerful scripted solution that Vince will help me with — in two weeks time. I will post details of any interesting (or funny) suggestions that you come up with on this site. The full details are below.

Competition rules

  1. Tweets should name a 'useful bioinformatics data skill'
  2. Tweets must contain the hashtag #ACGT
  3. Last day to enter into the competition is 25th September (midnight PST)
  4. One winner will be chosen randomly
  5. Only one entrant per twitter account
  6. Retweets of tweets that use #ACGT hashtag will be excluded

Trying to locate the source of duplicated software names

Thanks to Andrew Su (@andrewsu) and Mick Watson (@BioMickWatson) for alerting me to the following:

The former paper is from 2009, the latter paper is from 2015. Neither paper has anything to do with this 2010 paper which introduced something called the Genome Positioning System (GPS). Most importantly, none of these papers have anything at all to do with GPS (as most people understand the term).

If I run a Google search for GPS Bioinformatics the top hit that I see is for the MSc course in Bioinformatics as part of Brandeis University's Graduate Professional Studies program.

The usual disclaimer applies:

  1. Check existing literature before you name your software (at the very least run a Google search).
  2. Double check the name by adding the word 'bioinformatics' or 'genomics' to the search terms.
  3. Avoid names which wholly or partially contain words or terms that have nothing to do with your software.

Freely & Unrepentantly Confessing to Heresy

There is a new post by Keith Robison in which he comments on my Excel/Bioinformatics post from August 28th:

Keith Bradnam reported a huge influx of traffic for a recent post -- not surprising, since he labeled it NSFW (Not Safe For WorK)… Bradnam was, of course, kidding. His short item showered derision on a recent Microsoft announcment about importing sequences into Excel.

The spike in traffic really has been insane. The post has become my most viewed post of anything I have ever written on this blog (by quite a margin). Clearly I have tapped into some anti-Microsoft (or just anti-Excel?) sentiment.

Keith Robison then takes on the 'case for the defense' and makes some fair points about Excel:

But to dismiss Excel as unworthy of any use in bioinformatics is to miss the fact that buried under the residue of years of creeping featurism is a tool useful in specific contexts and with some key advantages. The first advantage is that it is ubiquitous…

He then goes on to include some legitimate examples of how you might want to use Excel in order to work with sequence data.

I will conclude by saying that I bear no ill feelings towards Excel users, even those using it for bioinformatics! I have also used Excel in the past for trying to analyze some bioinformatics data. This was using Excel 2004 for Mac which suffered from a 32,000 row limit which made it unsuitable for incorporating some datasets. It was this limitation that was partly the impetus for me to start learning how to do some things in R.

Fundamentally, I feel that Excel is not a great tool for bioinformatics because: it is not open, it is not obviously workable as part of any standard bioinformatics pipeline, and it is not available on Linux (where a lot of bioinformatics happens). But, as always, you should use the tools that help you get the job done.

Why I still think people should be jumping on the ORCID bandwagon

Adapted from photo by flickr user nanoprobe67

Adapted from photo by flickr user nanoprobe67

So I wrote this tweet…

Which triggered a lot of twitter discussion.

Which led to this blog post by Mick Watson.

Which led to more discussion on twitter.

Which led to this blog post by Brian Kelly.

Which leads us to this blog post…

Much of what I was going to say has already been said by others (I especially encourage readers to jump straight to Brian Kelly's blog post), but I wanted to add a few comments…

This is bad

A tremendous, mind-boggling, and frustrating number of hours are lost every year by people performing mindless, thankless, and painful academic administration tasks. A lot of of this work is stuff that happens 'behind the scenes', but which is essential for science to happen. Processes such as grant renewals often have to pull together all of the papers generated over the preceding grant period, and connect those papers to the researchers who were on the previous grant (and who may be named on the renewal grant). For a large research center with 100s of PIs this can involve a lot of work (tracking 100s of publications), and ultimately can end up relying on a lot of emails being sent to individual PIs. The problems get worse in those situations where people's names have changed and/or have submitted papers using slightly different formats to their names.

All of this pain is because we don't have unique identifiers for academic researchers that are consistently used across all parts of the academic system.

In taxonomics we are blessed with the widespread adoption of NCBI's taxonomy IDs. This means that I could write a publication in which I choose to describe Mick Watson as belonging to species NCBI TaxID 9606 and others would be able to work with that data. Indeed, I can go UniProt and browse species 85621 and know that this will be the same ID as used at NCBI (and many other places).

Species names can, and do, sometimes change (remember Fugu rubripes?) and biological research would be in a mess if we didn't have standardized identifiers for species. The same should be true for academics. No-one should have to waste time checking whether this 2015 paper by 'M Watson' is the same author as in this 2015 paper by 'M Watson' (this type of problem is greatly compounded for certain names).

This would be nice

I envisage a future where any publication that I contribute to is connected to my ORCID ID (strictly just 'ORCID'). Furthermore, any grant that I am named on should use the same ORCID. But why stop there? Why not connect all of my scientific endeavors? GitHub accounts could be connected to ORCID to tag all of my scientific coding with the same ID. Publish research slides on Figshare or Slideshare? Why not use ORCID? Even blog posts like this one could potentially be connected to the wider universe of ORCID-tagged material.

This is why I originally tweeted my excitement about the adoption of ORCID by arXiv. I want to see ORCID everywhere. Once ORCID gains critical mass by being adopted by enough 'key' services, then the ORCID bandwagon can really start to accelerate.

How I see ORCID being used

Echoing the views of Brian Kelly (and others), I just see ORCID primarily as a service to generate the unique ID and then act as a central authentication server for any other service that may wish to also use ORCID. In many ways, ORCID then acts like twitter, Google, and Facebook in letting you have a single sign-on system across multiple sites. Except ORCID is open and will not be mining your data to sell you stuff.

I have no interest in maintaining an ORCID page of publications. I want others to use the ORCID API to build clever tools that will leverage all of the rich information that could come about when you connect people to all of the scientific output that they have helped create. If Mick Watson ever decides to start being known as Sir Michael of Grimsby, and if he switches from using GitHub to BitBucket, this should not be a barrier from someone using the ORCID API to write a tool that generates a list of 'All of Mick's Public Code'.

ORCID may not succeed, but the promise of what it could deliver is so important that we should all give it the benefit of the doubt and try to make it work. If you have problems with ORCID, let them know. Most importantly, if you don't yet have one you should register for your ORCID identifier now! It is an open platform, run by a non-profit organization (these are good things). It takes just 30 seconds, and apart from those 30 seconds you have nothing to lose.