ORCID: Over 1.5 million IDs served

I was pleased to read that there are now over 1.5 million ORCID IDs in existence. If you didn't know, ORCID provide unique identifiers to researchers. Once you have an ORCID identifier, you can start linking all of your research to that identifier. If you change name, or if you suffer from having a very common surname, ORCID makes it easier to track your contributions to science.

From their about page:

ORCID is an open, non-profit, community-driven effort to create and maintain a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers. ORCID is unique in its ability to reach across disciplines, research sectors and national boundaries. It is a hub that connects researchers and research through the embedding of ORCID identifiers in key workflows, such as research profile maintenance, manuscript submissions, grant applications, and patent applications.

I'm hopeful that ORCID will one day become the glue to tie together all scientific output. Written a blog post, or grant, or git commit message for some piece of scientific software? There is no reason why these couldn't all be 'digitally signed' with your ORCID identifier.

If you don't have an ORCID ID — and yes I appreciate that this is somewhat of a clumsy nomenclature — you really should register now.

Mining UniProt: the Roy Chaudhuri quest to find DNA-like protein sequences

Image adapted from original image by flickr user lwr

So I recently came up with this idea for #tweetsascode: using twitter to write tweets which contain functional programs within their 140 characters. I posted one such example last Friday, a Perl script which checks a FASTA file (specified on the command-line), in order to determine whether it contains a protein or DNA sequence:

#!/usr/bin/perl
use strict; 
use warnings;

while(<>){
    next if m/(^>)|(^$)/;
    die "Protein" if (m/[EFILOPQ]/i);
    die "DNA";
}
#tweetsascode

This code skips FASTA definition lines (those starting with '>') and blank lines, and then asks: does the first line of sequence contain any of the seven amino-acid characters which are not IUPAC nucleotide characters? If so, then it must be protein; otherwise the sequence is probably DNA.

This led @RoyChaudhuri to comment:

Roy's point is that there are so many IUPAC nucleotide characters, that a protein sequence which only contained 13 out of the 20 canonical amino acids, would also pass the test as a valid nucleotide sequence. Is it possible to therefore determine how many 'DNA-like' proteins there are?

Experiment

With the help of a little Perl script, I did the following:

  1. I first downloaded the FASTA files for SWISS-PROT and TrEMBL, which collectively comprise the UniProt protein database. If you didn't know, SWISS-PROT contains manually annotated entries whereas the much larger TrEMBL database is automatically annotated.
  2. For every protein sequence in SWISS-PROT or TrEMBL, my script counts the use of various protein ambiguity characters (this was just out of curiosity). These are B (aspartic acid or asparagine), J (leucine or iosoleucine), Z (glutamate or glutamine), and X (unknown amino acid).
  3. The script also counts usage of the 21st and 22nd amino acids (selenocysteine and pyrrolysine, which have the valid IUPAC characters U and O respectively).
  4. The script counted any protein sequences which only contained amino acids that have equivalent IUPAC characters for the set of four canonical nucleotides (i.e. alanine, cysteine, glycine, and threonine).
  5. Finally, the script counted any protein sequences which only contained amino acids that have equivalents from any of the 16 IUPAC nucleotide characters.

Results (SWISS-PROT)

Dataset = 549,008 proteins

  • 546,360 only contained the 20 'classic' amino acid characters
  • 254 contained selenocysteine characters (U)
  • 29 contained pyrrolysine characters (O)
  • 138 contained alternative amino acid character B (representing D or N)
  • 0 contained alternative amino acid character J (representing L or I)
  • 114 contained alternative amino acid character Z (representing E or Q)
  • 2,222 contained unknown amino acid characters (X)
  • Only 1 protein was comprised entirely of A, C, G, and T
  • An additional 123 proteins were comprised entirely from valid IUPAC nucleotide characters ([ACGTNUWSKMRYBDHV])

The sequence that contained only 'classic' DNA characters was a 31 amino acid fragment, which turned out to contain only two different amino acids (alanine and threonine):

>sp|P02732|ANP3_PAGBO Ice-structuring glycoprotein 3 (Fragments) OS=Pagothenia borchgrevinki PE=1 SV=1 AATAATAATAATAATAATAATAATAATAATA

Of the 123 proteins that used various characters from the full set of IUPAC nucleotide characters, this 128 amino acid protein was the longest:

>sp|Q925H4|KR211_MOUSE Keratin-associated protein 21-1 OS=Mus musculus GN=Krtap21-1 PE=2 SV=2 MCCNYYGNSCGGCGYGSRYGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYG SGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSGYGCGYGSG YGCGYGSRYGCGYGSGCCSYRKCYSSCC

This SWISS-PROT entry (accession Q925H4) is a mouse protein which has experimental evidence, and which has an Annotation score of 5 out of 5.

Results (TrEMBL)

Dataset = 50,011,027 proteins

  • 49,373,499 only contained the 20 'classic' amino acid characters
  • 1697 contained selenocysteine characters (U)
  • 199 contained pyrrolysine characters (O)
  • 15,314 contained alternative amino acid character B (representing D or N)
  • 0 contained alternative amino acid character J (representing L or I)
  • 5,842 contained alternative amino acid character Z (representing E or Q)
  • 632,742 contained unknown amino acid characters (X)
  • Only 2 proteins were comprised entirely of A, C, G, and T
  • An additional 1,827 proteins were comprised entirely from valid IUPAC nucleotide characters ([ACGTNUWSKMRYBDHV])

In this much larger set of proteins we still don't see a sequence that resembles 'classic DNA' that is any longer than the 31 amino acid fragment found in SWISS-PROT. Instead, the longest sequence was a 24 amino acid fragment (which only contains alanine and glycine):

>tr|U6PUI2|U6PUI2HAECO ISE/inbred ISE genomic scaffold, scaffoldpathogensHcontortusscaffold6340 (Fragment) OS=Haemonchus contortus GN=HCOI01698300 PE=4 SV=1
GAAAAGGGGGGGGGGGGAAAAGGA

However, in TrEMBL there was a much longer sequence which contains various characters from the full set of IUPAC nucleotide characters. This uncharacterized protein is 495 amino acids long, and contains mostly serine, arginine, and cysteine:

>tr|A0A0E9H024|A0A0E9H024STREE Uncharacterised protein OS=Streptococcus pneumoniae GN=ERS23250802220 PE=4 SV=1 MRSRSYYTSVSRRKSSSSSSRSSSSSRSSSSCSSCRSSSSSRSSSSCRSS SSCSSCRSSRSSRSSSSCSSSRSCRSCSSSRSCSSCRSSSSCSSCRSSRS SRSSSSSRSSRSSSSCSSSRSSSSCRSSRSSRSSRSSRSCSSCRSSSSCS SCRSCRSSSSCSSCRSSRSSRSSSSCRSSSSCSSCSSCRSSRSSSSCSSS RSCRSCRSCRSSSSSSSCSSSRSSRSSSSSRSCSSCRSSSSCSSCRSSSS CSSCRSSRSSSSCSSCSSCRSSRSSSSCSSSRSSRSSRSCRSSSSSRSCR SCSSCRSSSSCSSCRSSRSSRSCSSCRSSRSSRSCRSSSSSRSSRSSSSS RSSRSSSSCRSSRSSSSSRSCSSCRSSRSSSSCSSSRSSSSSRSCSSSRS CSSCRSSSSCSSCRSSRSSRSSSSSRSSRSSSSCSSSRSSSSCRSSRSSR SSRSSRSSSSSRSSRSSSSSRSSRSSSSCRSCRSSSSCSSCSSSR

Honorable mention to XXX-rated protein

It is worth giving a shout out to UniProt accession W4XLU5 (from the TrEMBL database). This uncharacterized protein has a length of 21,842 amino acids…21,292 of which are represented by unknown amino acids!!! This is probably why the protein has an Annotation score of just 1 out of 5.

Conclusions

  1. To answer Roy's initial question, only 0.00004% of proteins in UniProt (1,953 / 50,560,035) fulfil the requirement of only containing amino acids that have equivalent IUPAC nucleotide characters.
  2. From a coding point of view, one should possibly account for the fact that you can see almost 500 DNA-like characters in a sequence, but you still could be looking at a protein sequence.
  3. A ~22,000 amino acid protein which contains 97% 'unknown' residues, should maybe take the award for 'least-useful protein annotation'

Insights into the mind and practices of a bioinformatics developer

Michael Barton  — who has the coolest twitter handle: @bioinformatics — is a talented bioinformatician who has a great, though infrequently updated, blog (Bioinformatics Zen).

His latest post (How I Develop) takes a look at his overall coding and working practices, reflecting on his decade long experience in bioinformatics. I particularly like the section of 'Focus':

Getting up very early. I built nucleotid.es by getting up at 6am every morning for about 6 months. An average of 2-3 hours of time everyday allowed me to create the prototype site to prove that this could be useful. These hours in the morning feel like my most productive and usually there's no one else awake to interrupt you.

Well worth a read, as is his FAQ.

Does your bioinformatics software pass the 'elevator test'?

The name of your bioinformatics software is important. A good name should be clear, unambiguous, pronouncable, memorable, and meaningful. Sadly many (most?) names of existing tools do not satisfy all of these criteria. Here is a simple thought experiment that you can use when trying to decide on a new name for your software; this is something which might help you avoid many common naming problems that can arise.

Imagine that you are in an elevator going from the 6th floor of a building to the ground floor. The elevator stops at the 5th floor and a visiting bioinformatics/genomics scholar steps in. He/she is someone that you admire and someone who you would really like to know about the latest software tool that you've been working on.

They press the button for the 2nd floor. You have maybe 30 seconds to introduce the tool and hopefully make them curious enough to check it out when they next get back to their computer. You say something like:

Hi. I'm a big fan of your work. I wanted to let you know that I've been working on a tool that you might be interested in…it's called 'X'

In this example, we will assume that you may never see this person again and that you don't know when they will have time to look up your software tool. It might be days, so the name has to be something that they will remember. The more meaningful and pronouncable the name, the more chance that it will be memorable.

Now, let's consider the names of some recently published bioinformatics tools…do these pass the elevator test? You should always consider how you might have to spell out the name of your software:

  • tmle.npvi — tee-em-el-ee-dot-en-pee-vee-aye
  • EW_dmGWAS — Ee-double-you-underscore-dee-em-gee-was
  • do_x3dna — dee-oh- (or do?) -underscore-ex-three-DNA
  • R3D-2-MSA — ar-three-dee-dash-two-dash-em-ess-ay 
  • Pse-in-One — pee-ess-ee- (or see?) -dash-in-dash-one
  • (PS)2 — open-parentheses-pee-ess-close-parentheses-superscript-two

In these examples you would probably choose to omit details of the dots, dashes, underscores, parentheses, and superscript characters that are part of the name. So you should ask yourself whether you really need to include them in the first place.

The bottom line is that it is not enough for the name of your sofware to be comprehensible when read from a screen or page…it should also sound good!

Bioinformatics and genomics resources on reddit

 

Although many people in this field turn to sites like SEQanswers and Biostars to get help with bioinformatics problems, there are a number of subreddits that are devoted to discussion of bioinformatics and genomics. Reddit isn't just a forum for asking questions though, and people also share lots of relevant links (to papers, resources, news etc.). As a new subreddit appeared this week, I thought I'd present a quick roundup:

 

1. bioinformatics

This is the most popular subreddit in this list (in terms of readers). The posts are roughly split equally between sharing links of interest and asking questions. Some of the questions frequently relate to career advice from people wanting to get into this field.

 

2. genomics

  • URL:https://www.reddit.com/r/genomics/
  • Description: Genomics, genetics, DNA, health, and personalized medicine
  • Frequency of posts: Infrequent, ~1–5 new items per week_
  • Current readership:_ 3,085 readers

This subreddit seems to be declining in popularity. It mostly features shared links rather than people asking questions.

 

3. genome

  • URL:https://www.reddit.com/r/genome/
  • Description: Please submit primary genomics literature, discussions of primary literature (e.g. blog posts and serious news stories), and resources for genomics research.
  • Frequency of posts: Moderate, ~5–10 new items per week
  • Current readership: 211 readers

This is a relatively new subreddit and it is focused on people sharing new and interesting papers, and using the subreddit to discuss those papers.

 

4. learnbioinformatics

  • URL:https://www.reddit.com/r/learnbioinformatics/
  • Description: This is a subreddit for providing you with the most relevant academic papers, textbooks, websites, and tutorials in the field of bioinformatics. If you have any recommended resources, please feel free to post away!
  • Frequency of posts: Too early to reliably predict
  • Current readership: 299 readers

This is the newest subreddit (just a few days old), and so has only attracted about a dozen posts. The intended role of this subreddit has obvious overlaps with the other subreddits.

Concerning the gender ratio of speakers at the 2015 Genome Science Meeting

The Genome Science 2015 meeting has announced their speaker line-up. At the time of writing, not all of the speaker positions are finalized, but currently the published agenda reveals:

  • 13 men
  • 9 women

So currently 41% of the speakers are women which is excellent. Hoping that the remaining 15 slots keep this conference free from notable gender bias.

Time for a couple of new JABBA awards for bogus bioinformatics acronyms

It's been a while since I handed out any JABBA awards (Just Another Bogus Bioinformatics Acronym). These are awarded to developers of bioinformatics tools who name their software using tenuously derived acronyms or initalisms. Both awardees featured in a recent issue of Nucleic Acids Research. First up, we have:

I'm aways slightly worried when the publication doesn't reveal the source of an acronym. I had to go to the I-TASSER website to find out the grim truth:

Iterative Threading ASSEmbly Refinement

It's not an altogether terrible name, but to me it doesn't include any mention of protein structure or function prediction (which is the purpose of the tool). Even if you see this full name written down somewhere, one might assume that it is something to with other types of assembly?

Next up is a software tool which has a nice short name:

Just five letters in the name, they probably correspond neatly to five words in the software's full name, right? Er, no.

Signature-based ClUstering for DiagnOstic purposes

This is one of the most tenuous initialisms that I have seen. In these situations, I would urge developers to really not try to make into an acronym or intialism at all. Just calling the tool 'SCUDO' would be preferable — in my opinion — than trying to use such a clumsy initialism.

Another take on our new Unix & Python book

Following on from my announcement yesterday that I am involved in writing a new book (Unix and Python to the Rescue!), my co-author Michelle has written a few words about it on her blog which I encourage you to read. She tackles the issue of why people in our field (the life-sciences) should learn to code:

It is my belief--based on my own experience--that using a prefabricated tool, such as a spreadsheet or graphing program, inherently limits you to someone else's idea of what analytical questions you should be asking about your data.

In today's scientific world, the amount and type of data we need to understand changes rapidly, and these programs can quickly become limiting. By taking the time to learn a set of basic tools that can be combined in limitless ways, you empower yourself to ask the kinds of analytical questions you want to ask about your data

Taking steps to write a new book about programming!

The old book

I am very excited to announce that I am involved in writing another book about progamming! The 2012 book that I wrote with Ian Korf — Unix and Perl to the Rescue!: A Field Guide for the Life Sciences (and Other Data-rich Pursuits) — was enjoyable to write, and seemed to be well received — (4.5 star average on Amazon.com) and so we both wanted to do something else.

We wrote about Perl because it is the language that we had both used since the mid-1990s, and for a long while Perl was the language du jour for people working in bioinformatics. This has changed. The TIOBE software index uses search engine queries to track the popularity of all programming languages over time. In 2000, Perl was the 4th most popular language whereas Python ranked 24th. As of July 2015, Python has risen to 5th place, overtaking Perl which has dropped to 11th place. Not only is Python proving an extremely popular language, it is swiftly overtaking Perl in many areas involving the processing of biological data.

The new book

So we made a proposal to Cambridge University Press to write what we are provisionally calling Unix and Python to the Rescue! (this will no doubt be the start of a successful series which will culminate in Unix and Minecraft to the Rescue!). Happily, they have accepted our proposal and so we have recently started the process of writing the new book (hopefully due to be published in 2016).

We intend for this book to fulfill many of the same goals that we had for our earlier book:

  1. Contain basic material that introduces Unix & Python to someone who has never sat down at a terminal or written a line of code before.
  2. Include many advanced programming concepts in addition to the basics.
  3. Where possible, only introduce one new concept at a time.
  4. Write in a lively, engaging style in order to make the concepts fun!

For item #2, we envisage our book addressing topics such as NumPy, ScyPy, IPython Notebooks, and the pandas package, to name but a few!

 

The new co-author

For this new endeavor we have recruited the many talents of Michelle Gill (@modernscientist) who will bring her Python skills and all-round coding expertise to the project. Michelle is a scientist at the National Cancer Institute and has been using Python to analyze research data — and also using it for fun — for most of the last decade. You can find out more about Michelle, and see examples of her coding expertise on her excellent blog, themodernscientist. I asked her to say a few words about the new book:

"The purpose of this book is to equip scientists with the tools necessary to understand and analyze data in the way that directly suits their needs and can be reproduced in the future. With the ever increasing pace of research and volume of data generated, I am convinced the best way to accomplish this is by learning Python".

Michelle Gill

 

The new website

We previously had a website (unixandperl.com) and twitter account (@unixandperl) to support the old book and related materials. However, it seems fitting that we need to 'expand the brand', and so we have an updated website that can now be found at rescuedbycode.com (the old URL should still redirect here). This website continues to host information about the completely free Unix and Perl Primer for Biologists that we previously released, as well as the (also free) Command-line Linux Bootcamp that I recently added.

I expect that we will add some more posts to the new website in the coming months, and we will also continue to publish occasional items of relevance to the twitter account (also renamed to @rescuedbycode). Much of Python is new to me, and I hope to share some of my experiences from the point of view of someone who comes from a Perl background.

Okay, time for me to go and do some more research for the book.

101 questions with a bioinformatician #29: Jane Loveland

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.


Jane Loveland is a Senior Computer Biologist at The Wellcome Trust Sanger Institute where she is involved in a number of key projects relating to genome annotation and training.

As a manager in the HAVANA group (Human and Vertebrate Analysis and Annotation), she helps oversee the valuable work in using manual annotation to provide a reference gene set for the human, mouse, and zebrafish genomes. HAVANA's annotation is made publicly available via the Vega genome browser, which is in turn merged with the annotation in Ensembl to produce the reference GENCODE gene set.

Jane also leads a team of instructors for Wellcome Trust Advanced Courses which teach workshops all over the world, in particular the Open Door Workshops:

The Open Door Workshop provides an introduction to bioinformatics tools freely available on the internet, focussing primarily on the Human Genome data. The workshops provide hands-on training in the use of public databases and web-based sequence analysis tools, and are taught by experienced instructors.

And now, on to the 101 questions...



001. What's something that you enjoy about current bioinformatics research?

The speed of change. From an annotation view point, we are constantly having to find ways to use new data sources which in turn adds value to the annotation that we produce.

When I’m putting together a manual for a workshop I have to update everything, every time. I have come into bioinformatics from wet lab biochemistry/molecular biology and I once spent an entire week hand-crafting a multiple alignment figure for my thesis. I can do this in a few minutes now.



010. What's something that you don't enjoy about current bioinformatics research?

Everyone assumes that all genome sequences are 'finished' (KRB: I don't!). They may be sequenced but the quality is often pretty poor compared to the sequence that we were producing at the Sanger Institute about a decade ago.

You can’t interpret what’s going on in a genome if the underlying reference sequence is of poor quality. I do a lot of teaching and spend a lot of time explaining this to researchers.



011. If you could go back in time and visit yourself as a 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Just go for it. Bit of a cliché I know. I had a crippling lack of confidence when I was younger which I think really held me back.



100. What's your all-time favorite piece of bioinformatics software, and why?

For annotation: Blixem. This is an interactive graphical BLAST viewer — old but essential for gene annotation. Means that I can view alignments to the genome at base pair level really quickly and simply.

For workshops: Ensembl. You have to be able to browse a genome.



101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality, and why?

Can I have I for inosine? Reminds me of making degenerate primers for PCR. It's a multi-tasker, which is also how I see myself. It's not on the list though (KRB: everyone keeps breaking the rules!).