A timely call to overhaul how scientists publish supplementary material [Link]

Great new editorial piece in BMC Bioinformatics by Mihai Pop and Steven Salzberg that tackles a subject that people probably don't think about too much:

They highlight some of the problems that arise from the growing trend in some journals to publish very short articles that are accompanied by extremely lengthy supplementary material. They single out a few particularly lop-sided papers — including a 6-page article that has 165 pages of supplementary material — and make some solid observations about why this facet of publishing has become problem. Perhaps most importantly, citations that are buried in supplementary material do not get tracked by citation indices.

They conclude the paper with a proposal:

The ubiquitous use of electronic media in modern scientific publishing provides an opportunity for the better integration of supplementary material with the primary article. Specifically, we propose that supplementary items, irrespective of format, be directly hyper-linked from the text itself. Such references should be to specific sections of the supplementary material rather than the full supplementary text.

Yes, yes, a thousand times yes!

We asked 272 bioinformaticians…name something that makes you angry: more reflections on the poor state of software documentation.

I'd like to share the details of a recent survey conducted by Nick Loman and Thomas Connor that tried to understand current issues with bioinformatics practice and training.

The survey was announced on twitter and attracted almost 300 responses. Nick and Tom have kindly placed the results of the survey on Figshare so that others can play with the data (it seems fitting to talk about this today as it is International Open Access Week):

When you ask a bunch of bioinformaticians the question What things most frustrate you or limit your ability to carry out bioinformatics analysis? you can be sure that you will attract some passionate, and often amusing, answers (I particularly liked someone's response to this question "Not enough Heng Li").

I was struck by how many people raised the issue of poor, incomplete, or otherwise terrible software documentation as a problem (there were at least 42 responses that mentioned this). The availability of 'good documentation' was also listed as the 2nd most important factor when choosing software to use.

I recently wrote about whether this problem is something that really needs to be dealt with by journals and by the review process. It shouldn't be enough that software is available and that it works, we should have some minimal expectation for what documentation should accompany bioinformatics software.

Keith's 10 point checklist for reviewing software

If you are ever in a position to review a software-based manuscript, please check for the following:

  1. Is there a plain text README file that accompanies the software and which explains what the program does and who created it?
  2. Is there a comprehensive manual available somewhere that describes what every option of the program does?
  3. Is there a clear version number or release date for the software?
  4. Does the software provide clear installation instructions (where relevant) that actually work?
  5. Is the software accompanied by an appropriate license?
  6. For command-line programs, does the program give some sensible output when no arguments are provided?
  7. For command-line programs, does the program give some sensible output when -h and/or --help is specified (see this old post of mine for more on this topic)?
  8. For command-line programs, does the built-in help/documentation agree with the external documentation (text/PDF), i.e. do they both list the same features/options?
  9. For script based software (Perl, Python etc.), does the code contain a reasonable level of comments that allow someone with relevant coding experience to understand what the major sections of the program are trying to do?
  10. Is there a contact email address (or link to support web page) provided so that a user can ask questions and get more help?

I'm not expecting every piece of bioinformatics software to tick all 10 of these boxes, but most of these are relatively low-hanging fruit. If you are not prepared to provide useful documentation for your software, then you should also be prepared for people to choose not to use your software, and for reviewers to reject your manuscript!

Making code available: lab websites vs GitHub vs Zenodo

Our 2007 CEGMA paper was received by the journal Bioinformatics on December 7, 2006. So this means it was about 9 years ago that we had to have the software on a website somewhere with a URL that we could use in a manuscript. This is what ended up in the paper:

Even though we don't want people to use CEGMA anymore, I'm at least happy that the link still works! I thought it would be prudent to make the paper link to a generic top-level page of our website (/Datasets) rather than to a specific page. This is because experience has taught me that websites often get reorganized, and pages can move.

If we were resubmitting the paper today and if we were still linking to our academic website, then I would probably suggest linking to the main website page (korflab.ucdavis.edu) and ensuring that the link to CEGMA (or 'Software') was easy to find. This would avoid any problems with moving/renaming pages.

However, even better would be to use a service like Zenodo to permanently archive the code; this would also give us a DOI for the software too. Posting code to repositories such as GitHub is (possibly) better than using your lab's website, but people can remove GitHub repositories! Zenodo can take a GitHub repository and make it (almost) permanent.

The Zenodo policies page makes it clear that even in the 'exceptional' event that a research object is removed, they still keep the DOI URL working and this page will state what happened to the code:

Withdrawal
If the uploaded research object must later be withdrawn, the reason for the withdrawal will be indicated on a tombstone page, which will henceforth be served in its place. Withdrawal is considered an exceptional action, which normally should be requested and fully justified by the original uploader. In any other circumstance reasonable attempts will be made to contact the original uploader to obtain consent. The DOI and the URL of the original object are retained.

There is a great guide on GitHub for how you can make your code citable and archive a repository with Zenodo.

Academic link rot seems to be getting faster: should a published URL last more than 100 days?

Consider this paper that was recently published in the journal Bioinformatics, and which showed up today in my RSS feed:

Presumably it is a typo when the journal says that it was received on November 14th 2014:

I'll assume that this is meant to be 2013! The paper first appeared online on June 13th 2014, just 103 days ago. The text of this paper links to some software that should be available at http://ww2.cs.mu.oz.au/∼gwong/LICRE. Except that this URL doesn't work. Neither does http://ww2.cs.mu.oz.au/∼gwong/. Only when I visit http://ww2.cs.mu.oz.au/ do I discover the following:

The new website for the merged departments says that the merger happened in 2012, and this is confirmed by the redirection page which has a date of 18th January 2012. It is also confirmed by looking at the Internet Archive's Wayback Machine which shows that the redirection page has been in place since at least February 2012. 

All of which suggests that the software link in the paper may have not even worked properly at the time they submitted the manuscript. I'm sure there are other similar examples of speedy link rot, but this seems particularly striking. Especially since a search for 'LICRE' on the new website doesn't return any hits (nor can I find any mention of it on Google or various search engine caches).

I will contact the lead author to let him know about the disappearance of the software. In the meantime, I'll remind people of this previous post of mine:

Update 2014-09-24 19.52:  I heard back from the author, the LICRE code is now at https://sites.google.com/site/licrerepository/

When is a citation not a citation?

Today I received a notification from Google Scholar that one of my papers had been cited. I often have a quick look at such papers to see how our work is being referenced. The article in question was from the Proceedings of the 3rd Annual Symposium on Biological Data Visualization: Data Analysis and Redesign Contests:

FixingTIM: interactive exploration of sequence and structural data to identify functional mutations in protein families

The paper describes a tool that helps "identify protein mutations across a family of structural models and to help discover the effect of these mutations on protein function". I was a bit surprised by this because this isn't a topic that I've published on. So I looked to see what paper of mine was being cited and how it was being cited. Here is the relevant sentence from the background section of the paper:

To improve the exploration process, many efforts have been made, from folding the sequences through classification [1,2], to tools for 3D view exploration [3] and to web-based applications which present large amounts of information to the users [4].

Citation number 2 is the paper on which I am a co-author:

  • Chen N, Harris TW, Antoshechkin I, Bastiani C, Bieri T, Blasiar D, Bradnam K, Canaran P, Chan J, Chen C, Chen WJ, Cunningham F, Davis P, Kenny E, Kishore R, Lawson D, Lee R, Muller H, Nakamura C, Pai S, Ozersky P, Petcherski A, Rogers A, Sabo A, Schwarz EM, Van Auken K, Wang Q, Durbin R, Spieth J, Sternberg PW, Stein LD: Wormbase: A comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res 2005, 33(1):383-389.

The cited paper simply describes the WormBase database and includes only a passing reference to the fact that WormBase contains some links to protein structures (when known), but that's about it. The WormBase paper doesn't mention 'folding' or 'classification' anywhere, which makes it seem a really odd choice of paper to be cited. It makes me wonder how many other papers end up gaining seemingly spurious citations like this one.

5 things to consider when publishing links to academic websites

Preamble

One of the reasons I've been somewhat quiet on this blog recently is because I've been involved with a big push to finish the new Genome Center website. This has been in development for a long time and provides a much needed update to the previous website that was really showing its age. Compare and contrast:

The old Genome Center website…what's with all that whitespace in the middle?

The new Genome Center website, less than 24 hours old at the time of writing.

This type of redesign is a once-in-a-decade event, and provides the opportunity not just to add new features (e.g. proper RSS news feed, twitter account, YouTube channel, respsonvive website design etc.), but also to clean up a lot of legacy material (e.g. webpages for people who left the Genome Center many years ago).

This cleanup prompted me to check Google Scholar to see if there are any published papers that include links to Genome Center websites. This includes links to the main site and also to all of the many subdomains that exist (for different labs, core facilities etc.) It's pretty easy to search Google Scholar for the core part of a URL, e.g. genomecenter.ucdavis.edu and I would encourage anyone else that is looking after an aging academic website to do so.

Here are some of the key things that I noticed:

  1. Most mentions of Genome Center URLs are to resources by Peggy Farnham's lab. Although Peggy left UC Davis several years ago (she is now here), her — very old, and out of date — lab page still exists (http://farnham.genomecenter.ucdavis.edu).
  2. Many people link to Craig Benham's work using http://genomecenter.ucdavis.edu/benham/. This redirects to Craig's own lab site (http://benham.genomecenter.ucdavis.edu), but the redirect doesn't quite work when people have linked to a specific tool (e.g. http://genomecenter.ucdavis.edu/benham/sidd). This redirects to http://benham.genomecenter.ucdavis.edu/sidd which then produces a 404 error (page not found).
  3. There are many papers that link to resources from Jonathan Eisen's group and these papers all point to various pages on a domain that is either down or no longer in existence (http://bobcat.genomecenter.ucdavis.edu).

There is an issue here of just how long is it valid to try to keep links active and working. In the case of Peggy Farnham, she no longer works at UC Davis, so is it okay if I redirected all of her web traffic to her new website? I plan to do this but will let Peggy know so that she can maybe arrange to copy some of the existing material over to her new site.

In the case of Craig's lab, maybe he should be adding his own redirect links for tools that now have new URLs. What would also help would be to have a dedicated 404 page which might point to the likely target page that people are looking for (a completely blank 'not found' page is rarely ever helpful).

In the case of Jonathan's lab, there is a big problem here in that all of the papers are tied to a very specific domain name (which itself has no obvious naming connection to his lab). You can always rename a new machine to be called 'bobcat', but maybe there are better things we should be doing to avoid these situations arising in the first place…

5 things to consider when publishing links to academic websites

  1. Don't do it! Use resources like Figshare, Github, or Dryad if at all possible. Of course this might not be possible if you are publishing some sort of online software tool.
  2. If you have to link to a lab webpage, consider spending $10 a year or so and buying your own domain name that you can take with you if you ever move anywhere else in future. I bought http://korflab.com for my boss, and I see that Peggy Farnham is now using http://farnhamlab.com.
  3. If you can't, or don't want to, buy your own domain name, try using a generic lab domain name and not a machine-specific domain name. E.g. our lab's website is on a machine called 'raiden' and can be accessed at http://raiden.genomecenter.ucdavis.edu. But we only ever use the domain name http://korflab.ucdavis.edu which allows us to use a different machine as the server without breaking any links.
  4. If you must link to a specific machine, try avoiding URLs that get too complex. E.g. http://supersciencelab.ucdavis.edu/Tools/Foo/v1/foo_v1.cgi. The more complex the URL, the more likely it will break in future. Instead, link to your top level domain (http://supersciencelab.ucdavis.edu) and provide clear links on that page on how to find things.
  5. Any time you publish a link to a URL, make sure you keep a record of this in a simple text file somewhere. This might really help if/when you decide to redesign your website 5 years from now and want to know whether you might be breaking any pre-existing links.

 

Is there ever a valid reason for storing bioinformatics data in a Microsoft Word document?

Short answer

No.

Long answer

Noooooooooo!!!

Background

Yesterday I finished reviewing a paper. My review was generally very positive and I enjoyed reading the manuscript. The authors linked to some supplementary files that were available on another website. As I'm the type of reviewer that likes to look at every file that is part of a submission, I logged on to the website to see what files were there.

The first file that was listed had a 'docx' extension. Someone might argue that if this file contained a textual description of how the other files were being generated, then maybe there is nothing wrong with somebody using Microsoft Word. I would disagree. Any sort of documentation should ideally be in plain text, and maybe PDF as an alternative.

In any case, I opened the file to see what we were dealing with. The file contained a list of several thousand gene identifiers, one identifier per line. There was nothing else in the thirty-six page file.

This is not an acceptable practice! Use of Microsoft Word to store bioinformatics data will only ever result in unhappiness, frustration, and anger. And we all know what anger leads to…

Supplemental madness: on the hunt for 'Figure S1'

I've just been looking at this new paper by Vanesste et al.  in Genome Research:

Analysis of 41 plant genomes supports a wave of successful genome duplications in association with the Cretaceous–Paleogene boundary

I was curious as to where their 41 plant genomes came from, so I jumped to the Methods section to see: 

No surprise there, this is exactly the sort of thing you expect to find in the supplementary material of a paper. So I followed the link to the supplementary material only to see this:

So the 'Supplemental Material' contains 'Supplemental Information' and the — recursively named — 'Supplemental Material'. So where do you think Supplemental Table S1 is? Well it turns out that this table is in the Supplemental Material PDF. But when looking at both of these files, I noticed something odd. Here is Figure S1 from the Supplemental Information:

And here is part of another Figure S1 from the Supplemental Material file:

You will notice that the former figure S1 (in the Supplemental Information) is actually called a Supporting Figure. I guess this helps distinguish it from the completely-different-and-in-no-way-to-be-confused Supplementary Figure S1.

This would possibly make some sort of sense if the main body of the paper distinguished between the two different types of Figure S1. Except the paper mentions 'Supplemental Figure S1' twice (not even 'Supplementary Figure S1) and doesn't mention Supporting Figure S1 at all (or any supporting figures for that matter)!

What does all of this mean? It means that Supplementary Material is a bit like the glove compartment in your car: a great place to stick all sorts of stuff that will possibly never be seen again. Maybe we need better reviewer guidelines to stop this sort of confusion happening? 

 

Impactstory: Publications are an important part of research…but they’re not the only part

I'm a great fan of the Impactstory service that makes it easy to aggregate all of your research output in one place, and then see how people are engaging with your research. I like it so much, that I signed up to be an Impactstory Advisor.

Today I'm giving a talk at UC Davis about Impactstory, and so that everyone can see why I like this service so much, I've made a video version of my presentation. 

Visit the Impactstory website to find out more or follow them on twitter (@Impactstory). For an example of the types of things that Impactstory can track, have a look at my own Impactstory page (impactstory.org/KeithBradnam) .