Why I still think people should be jumping on the ORCID bandwagon

Adapted from photo by flickr user nanoprobe67

Adapted from photo by flickr user nanoprobe67

So I wrote this tweet…

Which triggered a lot of twitter discussion.

Which led to this blog post by Mick Watson.

Which led to more discussion on twitter.

Which led to this blog post by Brian Kelly.

Which leads us to this blog post…

Much of what I was going to say has already been said by others (I especially encourage readers to jump straight to Brian Kelly's blog post), but I wanted to add a few comments…

This is bad

A tremendous, mind-boggling, and frustrating number of hours are lost every year by people performing mindless, thankless, and painful academic administration tasks. A lot of of this work is stuff that happens 'behind the scenes', but which is essential for science to happen. Processes such as grant renewals often have to pull together all of the papers generated over the preceding grant period, and connect those papers to the researchers who were on the previous grant (and who may be named on the renewal grant). For a large research center with 100s of PIs this can involve a lot of work (tracking 100s of publications), and ultimately can end up relying on a lot of emails being sent to individual PIs. The problems get worse in those situations where people's names have changed and/or have submitted papers using slightly different formats to their names.

All of this pain is because we don't have unique identifiers for academic researchers that are consistently used across all parts of the academic system.

In taxonomics we are blessed with the widespread adoption of NCBI's taxonomy IDs. This means that I could write a publication in which I choose to describe Mick Watson as belonging to species NCBI TaxID 9606 and others would be able to work with that data. Indeed, I can go UniProt and browse species 85621 and know that this will be the same ID as used at NCBI (and many other places).

Species names can, and do, sometimes change (remember Fugu rubripes?) and biological research would be in a mess if we didn't have standardized identifiers for species. The same should be true for academics. No-one should have to waste time checking whether this 2015 paper by 'M Watson' is the same author as in this 2015 paper by 'M Watson' (this type of problem is greatly compounded for certain names).

This would be nice

I envisage a future where any publication that I contribute to is connected to my ORCID ID (strictly just 'ORCID'). Furthermore, any grant that I am named on should use the same ORCID. But why stop there? Why not connect all of my scientific endeavors? GitHub accounts could be connected to ORCID to tag all of my scientific coding with the same ID. Publish research slides on Figshare or Slideshare? Why not use ORCID? Even blog posts like this one could potentially be connected to the wider universe of ORCID-tagged material.

This is why I originally tweeted my excitement about the adoption of ORCID by arXiv. I want to see ORCID everywhere. Once ORCID gains critical mass by being adopted by enough 'key' services, then the ORCID bandwagon can really start to accelerate.

How I see ORCID being used

Echoing the views of Brian Kelly (and others), I just see ORCID primarily as a service to generate the unique ID and then act as a central authentication server for any other service that may wish to also use ORCID. In many ways, ORCID then acts like twitter, Google, and Facebook in letting you have a single sign-on system across multiple sites. Except ORCID is open and will not be mining your data to sell you stuff.

I have no interest in maintaining an ORCID page of publications. I want others to use the ORCID API to build clever tools that will leverage all of the rich information that could come about when you connect people to all of the scientific output that they have helped create. If Mick Watson ever decides to start being known as Sir Michael of Grimsby, and if he switches from using GitHub to BitBucket, this should not be a barrier from someone using the ORCID API to write a tool that generates a list of 'All of Mick's Public Code'.

ORCID may not succeed, but the promise of what it could deliver is so important that we should all give it the benefit of the doubt and try to make it work. If you have problems with ORCID, let them know. Most importantly, if you don't yet have one you should register for your ORCID identifier now! It is an open platform, run by a non-profit organization (these are good things). It takes just 30 seconds, and apart from those 30 seconds you have nothing to lose.

Twenty years of bacterial genome sequences

Take-Home Message comic #5 celebrates an amazing milestone. During my PhD, I kept a little list pinned to the filing cabinet next to my desk, a list which contained details of every sequenced genome. This was something that was much easier when the number of published genomes was still in the single digits!

This is my favorite Take-Home Message comic to date. I feel that we are slowly settling in on a style that works well for this medium, and Abby's drawings just seem to get better and better.

101 questions with a bioinformatician #33: Sarah Teichmann

101 questions.png

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Sarah Teichmann is a Group Leader at the European Bioinformatics Institute and a Senior Group Leader at the Wellcome Trust Sanger Institute — the Genome Campus (at Hinxton, UK) is one of those strange places where you can walk 10 meters and become a different (and more senior) person!

Her research focuses on elucidating the principles of protein structure evolution, higher order protein structure and protein folding. She also has a longstanding interest in understanding gene expression regulation. As part of her work, she is involved with developing and maintaining a number of useful bioinformatics resources including the 3D Complex database.

Sarah was a recent recipient of the the prestigious European Molecular Biology Organization (EMBO) Gold Award for her use of 'computational and experimental methods to better understand genomes, proteomes and evolution'. She was also recently interviewed by CrossTalk (the blog of Cell Press): The Unstoppable Sarah Teichmann on Programing, Motherhood, and Protein Complex Assembly. I particularly liked Sarah's general advice to junior scientists:

Follow your heart and work on things you are excited about and enjoy. Life is too short—and academic careers too unpredictable—to settle for anything less. Try to work with people who are reasonable and considerate of others, yet driven and focused, and generous in investing time and resource to projects and careers of lab members and colleagues.

You can find out more about Sarah by visiting her group's website. And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

The data deluge! So much and so many kinds of biological data — ranging from all the versions of next-generation sequencing data to protein structures — it is such a gift. As computational biologists, we are in an unprecedented position to make new discoveries by mining this data, and we’re all having a ball.



010. What's something that you don't enjoy about current bioinformatics research?

I’m thinking hard to come up with something. One issue that has always puzzled me is why mainstream journals don’t recognise the value of pure theoretical and computational biology. The prediction of the structure of the double helix was recognised with a Nobel Prize, and celebrated more than the Franklin/Wilkins crystal structure. Predictions are generally given scant notice, and the experimental validation (often years later) is considered the key achievement. This strikes me as incongruous.



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Take programming and computer science seriously, and get some formal training in it.



100. What's your all-time favorite piece of bioinformatics software, and why?

R came after my time as a hands-on researcher (I’m more of a 90s Perl girl) but it seems to have revolutionised how quickly people can implement methods and visualise data. I also like the fact that there are now notebook-style ways of documenting whole workflows in R and Python. This can be included as supplementary material in publications and should help in making analyses easily reproducible by others.



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

Please can I choose three? A then U then G codes for "go bioinformatics" ☺

Microsoft's vision for bioinformatics research (caution: NSFW)

I came across this disturbing image on the web today. Warning, may cause offense:

Click to enlarge

Even more disturbing was the text that accompanied the image, text that appears on Microsoft Research's flickr account (emphasis mine):

The Microsoft Biology Initiative includes several Microsoft biology tools that enable biology and bioinformatics researchers to be more productive in making scientific discoveries. One such tool, the Microsoft Research Biology Extension for Excel, displays the contents of a FASTA file containing an Influenza A virus sequence. By importing FASTA data into Excel, researchers are better able to visualize and analyze information.

The point at which you want to import FASTA files into Excel is the point at which you should probably think about quitting bioinformatics.

101 questions with a bioinformatician #32: Aaron Quinlan

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Aaron Quinlan is an Associate Professor of Human Genetics and Biomedical Informatics at the University of Utah and the Associate Director of the USTAR Center for Genetic Discovery.

His research focuses on "developing and applying computational methods towards the understanding of genetic variation in diverse contexts". This work had led to Aaron's involvement in the development of many popular bioinformatics tools, with Bedtools being one of the most well known. I wish he had time to blog more, because then we could all enjoy more writing like this:

Have you ever been incensed by the ridiculous number of chromosome naming and ordering schemes that exist in genomics? If the answer is “no”, then either you are an incredibly patient person, you enjoy unnecessary chaos, or you just haven’t done any detailed analysis of genomics datasets.

You can find out more about Aaron by visiting his lab's website, or by following him on twitter (@aaronquinlan). And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

I come from a creative family and have always enjoyed building things. There is pure joy in having the power to conceive and apply an algorithmic idea that has the potential to improve our understanding of the biology of the genome and the genetic basis of disease.



010. What's something that you don't enjoy about current bioinformatics research?

The fashion.



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Take every math and statistics course possible and read constantly while you still have the time.



100. What's your all-time favorite piece of bioinformatics software, and why?

Without question, PolyBayes (Marth et al, 1999). I came to computational biology as a former software engineer without substantial training in biology. PolyBayes was the first Bayesian method for polymorphism detection and was written by my Ph.D. mentor, Gabor Marth. I spent much of my first year in graduate school dissecting the PolyBayes code (and the ACE file format)!!!) to understand the mathematic and data analysis strategies that were required at the time. That learning process has influenced much of the work I have done since.



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

N, since I constantly feel as though I am doing everything while also doing nothing.

 

DVD bonus materials


KRB: Because of the relative brevity of this interview, I thought that I would also share a couple of answers that Aaron gave me to some of the questions I also include when asking people to do these interviews (this info sometimes helps me write my introductions):


0111. What is the correct way of describing your current position or title(s)

  1. Associate Professor of Human Genetics and Biomedical Informatics
  2. Associate Director of the USTAR Center for Genetic Discovery
  3. Sender of the emails and bringer of the donuts.


1001. In 1–2 sentences, describe what your role entails

Basically doing everything I can to not be a bottleneck for the people in my lab.

BioDocker and BioBoxes: the containerization of bioinformatics

Thanks to a post on the BioCode's Notes blog I have discovered that there is a project called BioDocker which aims to generate lots of Docker containers to help make bioinformatics more reproducible by standardizing how bioinformatics software is packaged. From the BioDocker website:

The main purpose of this project is to spread the use of Docker on the Bioinformatics and Computational Biology areas. By using pre-configured containers with different bioinformatic softwares some critical aspects of Bioinformatics like reproducibility are minimized. Here you will find a list of containers with different bioinformatics software and how to use it.

BioDocker was created by Felipe da Veiga Leprevost in 2014, and the associated GitHub repository currently has a dozen or so containers.

When I was first read about BioDocker I was confused because I know that there is also the Bioboxes project which aims to er…make bioinformatics more reproducible by standardizing how bioinformatics software is packaged. From the Bioboxes manifesto:

Software has proliferated in bioinformatics and so have the problems associated with it: missing or unobtainable code, difficult to install dependencies, unreproducible workflows, all with terrible user experiences. We believe a community standard, using software containers, has the opportunity to solve these problems and increase the standard of scientific software as a whole.

I think the aims of these two projects are similar, but not identical and Bioboxes probably has a broader remit. Both projects are aware of each other and it looks like they have had some productive exchanges.

All of this makes me feel that the bioinformatics community seems to be slowly, but steadily, embracing Docker. Any approaches to standardize how we do bioinformatics should be welcomed, but some of us with long memories will recall that we have been in this situation before. Anyone remember the promises of how CORBA and then SOAP were going to increase interoperability in bioinformatics?

The name of this bioinformatics tool merits close inspection

  1. Bogus bioinformatics acronyms = mildly annoying
  2. Names that clash with previouly published tools = mildly annoying
  3. Bogus bioinformatics acronyms that clash with previouly published tools = very annoying

Step forward a new paper published in journal of Bioinformatics:

How does INSPEcT derive its name?

  • INSPEcT (INference of Synthesis, Processing and dEgradation rates in Time-course analysis)

Inclusion of the 'E' from 'degradation' and omission of 'R', 'C', or 'A' (from 'Rates', 'Course', and 'Analysis') earns this tool a JABBA award. It also earns a 'Duplications' award because of:

Bioinformatics is just like bench science and should be treated as such

A great post by Richard Edwards on his Cabbages of Doom blog, which includes a list of 8 shocking ways that bioinformatics is just like bench science. Highly recommended reading. His conclusion bears repeating here:

Bioinformatics is science. Full stop. It is no better than other science. It is no worse than other science. People do it right. People do it wrong.