We asked 272 bioinformaticians…name something that makes you angry: more reflections on the poor state of software documentation.

I'd like to share the details of a recent survey conducted by Nick Loman and Thomas Connor that tried to understand current issues with bioinformatics practice and training.

The survey was announced on twitter and attracted almost 300 responses. Nick and Tom have kindly placed the results of the survey on Figshare so that others can play with the data (it seems fitting to talk about this today as it is International Open Access Week):

When you ask a bunch of bioinformaticians the question What things most frustrate you or limit your ability to carry out bioinformatics analysis? you can be sure that you will attract some passionate, and often amusing, answers (I particularly liked someone's response to this question "Not enough Heng Li").

I was struck by how many people raised the issue of poor, incomplete, or otherwise terrible software documentation as a problem (there were at least 42 responses that mentioned this). The availability of 'good documentation' was also listed as the 2nd most important factor when choosing software to use.

I recently wrote about whether this problem is something that really needs to be dealt with by journals and by the review process. It shouldn't be enough that software is available and that it works, we should have some minimal expectation for what documentation should accompany bioinformatics software.

Keith's 10 point checklist for reviewing software

If you are ever in a position to review a software-based manuscript, please check for the following:

  1. Is there a plain text README file that accompanies the software and which explains what the program does and who created it?
  2. Is there a comprehensive manual available somewhere that describes what every option of the program does?
  3. Is there a clear version number or release date for the software?
  4. Does the software provide clear installation instructions (where relevant) that actually work?
  5. Is the software accompanied by an appropriate license?
  6. For command-line programs, does the program give some sensible output when no arguments are provided?
  7. For command-line programs, does the program give some sensible output when -h and/or --help is specified (see this old post of mine for more on this topic)?
  8. For command-line programs, does the built-in help/documentation agree with the external documentation (text/PDF), i.e. do they both list the same features/options?
  9. For script based software (Perl, Python etc.), does the code contain a reasonable level of comments that allow someone with relevant coding experience to understand what the major sections of the program are trying to do?
  10. Is there a contact email address (or link to support web page) provided so that a user can ask questions and get more help?

I'm not expecting every piece of bioinformatics software to tick all 10 of these boxes, but most of these are relatively low-hanging fruit. If you are not prepared to provide useful documentation for your software, then you should also be prepared for people to choose not to use your software, and for reviewers to reject your manuscript!

Should reviewers of bioinformatics software insist that some form of documentation is always included alongside the code?

Yesterday I gave out some JABBA awards and one recipient was a tool called HEALER. I found it disappointing that the webpage that hosts the HEALER software contains nothing but the raw C++ files (I also found it strange that none of the filenames contain the word 'HEALER'). This is what you would see if you go to the download page:

Today, Mick Watson alerted me to a piece of software called ScaffoldScaffolder. It's a somewhat unusual name, but I guess it at least avoids any ambiguity about what it does. Out of curiosity I went to the the website to look at the software and this is what I found:

Ah, but maybe there is some documentation inside that tar.gz file? Nope.

At the very least, I think it is good practice to include a README file alongside any software. Developers should remember that some people will end up on these software pages, not from reading the paper, but by following a link somewhere else. The landing page for your software should make the following things clear:

  1. What is this software for?
  2. Who made it?
  3. How do I install it or get it running?
  4. What license is the software distributed under?
  5. What is the version of this software?

The last item can be important for enabling reproducible science. Give your software a version number — the ScaffoldScaffolder included a version number in the file name — or, at the very least, include a release date. Ideally, the landing page for your software should contain even more information:

  1. Where to go for more help, e.g. a supplied PDF/text file, link to online documentation, or instructions about activating help from within the software
  2. Contact email address(es)
  3. Change log

This is something that I feel that reviewers of software-based manuscripts need be thinking about. In turn, this means that it is something that the relevant journals may wish to start including in the guidelines for their reviewers.

Excellent blog post about coding and documentation

There was an exchange on twitter today between several bioinformaticians regarding the need to have good documentation for bioinformatics tools. I was all set to write something about my own thoughts on this topic, but Robert Davey (@froggleston) has already written an excellent post on the subject (and probably done a better job of expressing my own views than I could):

I highly recommend reading his post as he makes some great points, including the following:

We need, as a community, usable requirements and standards for saying “this is how code should go from being available to being reusable“. How do we get our lab notebook code into that form via a number of checkpoints that both programmers and reviewers agree on?