Dovetail takes flight [Link]

If you ever want to know about the latest developments in sequencing, you owe it to yourself to follow Keith Robison's blog. In his latest post he talks about the launch of the new de novo assembly service from Dovetail Genomics. Keith concludes:

Personally, a pure service offering is very attractive, since that means not having to find internal resources to learn the new technology and then execute on it. I checked with Dovetail, and while I don't have $40K burning a hole in my pocket, if I did I could grab something out of the garden or from the local seafood market, I really could have a complex genome scaffold of my very own in about two months. That's an exciting vision, and perhaps will be a major force in the sunsetting of science's tolerance for highly fragmented draft genomes.

Readers may also enjoy Bio-IT World's report on this new Dovetail service.

Another survey on bioinformatics practices

I recently wrote about the bioinformatics survey that Nick Loman and Tom Connor published. Well if people are interested, there is another bioinformatics survey happening, organised by Elia Brodsky (@EliaBrodsky).

Elia works at Pine Biotech and he says that the results of the survey will be publicized on their website.

You can take the survey here and you can read more details about it on Elia's LinkedIn post: Bioinformatics - useful or just frustrating?

Another hard-to-pronounce bioinformatics software name

This was from a few months ago, published in the journal Nucleic Acids Research:

So how do you pronounce 'FunFHMMer'? I can imagine several possibilities:

  1. Fun-eff-aitch-em-em-er
  2. Fun-eff-aitch-em-mer
  3. Fun-eff-hammer
  4. Fünf-hammer

Reading the manuscript suggests that 'FunF' stems from 'FunFam(s)' which in turn is derived from 'functional families'. This would suggest that options 1 or 3 above might be the correct way to pronounce this software's name.

The fully expanded description of this web server's name becomes a bit of a mouthful:

Class Architecture Topology Homologous Superfamily Functional Families Hidden Markov Model (maker?)

We asked 272 bioinformaticians…name something that makes you angry: more reflections on the poor state of software documentation.

I'd like to share the details of a recent survey conducted by Nick Loman and Thomas Connor that tried to understand current issues with bioinformatics practice and training.

The survey was announced on twitter and attracted almost 300 responses. Nick and Tom have kindly placed the results of the survey on Figshare so that others can play with the data (it seems fitting to talk about this today as it is International Open Access Week):

When you ask a bunch of bioinformaticians the question What things most frustrate you or limit your ability to carry out bioinformatics analysis? you can be sure that you will attract some passionate, and often amusing, answers (I particularly liked someone's response to this question "Not enough Heng Li").

I was struck by how many people raised the issue of poor, incomplete, or otherwise terrible software documentation as a problem (there were at least 42 responses that mentioned this). The availability of 'good documentation' was also listed as the 2nd most important factor when choosing software to use.

I recently wrote about whether this problem is something that really needs to be dealt with by journals and by the review process. It shouldn't be enough that software is available and that it works, we should have some minimal expectation for what documentation should accompany bioinformatics software.

Keith's 10 point checklist for reviewing software

If you are ever in a position to review a software-based manuscript, please check for the following:

  1. Is there a plain text README file that accompanies the software and which explains what the program does and who created it?
  2. Is there a comprehensive manual available somewhere that describes what every option of the program does?
  3. Is there a clear version number or release date for the software?
  4. Does the software provide clear installation instructions (where relevant) that actually work?
  5. Is the software accompanied by an appropriate license?
  6. For command-line programs, does the program give some sensible output when no arguments are provided?
  7. For command-line programs, does the program give some sensible output when -h and/or --help is specified (see this old post of mine for more on this topic)?
  8. For command-line programs, does the built-in help/documentation agree with the external documentation (text/PDF), i.e. do they both list the same features/options?
  9. For script based software (Perl, Python etc.), does the code contain a reasonable level of comments that allow someone with relevant coding experience to understand what the major sections of the program are trying to do?
  10. Is there a contact email address (or link to support web page) provided so that a user can ask questions and get more help?

I'm not expecting every piece of bioinformatics software to tick all 10 of these boxes, but most of these are relatively low-hanging fruit. If you are not prepared to provide useful documentation for your software, then you should also be prepared for people to choose not to use your software, and for reviewers to reject your manuscript!

Your help needed: readers of ACGT can take part in a scientific study and win prizes

I’ve teamed up with researcher Paige Brown Jarreau (@fromthelabbench on twitter) to create a survey of ACGT readers, the results of which will be combined with feedback from readers of other science blogs.

Paige is a postdoctoral researcher at the Manship School of Mass Communication, Louisiana State University and her research focuses on the intersection of science communication, journalism, and new media. She also writes on her popular From the Lab Bench blog.

By participating in this 10–15 minute survey, you’ll be helping me improve ACGT, but more importantly you will be contributing to our understanding of science blog readership. You will also get FREE science art from Paige's Photography for participating, as well as a chance to win a t-shirt and a $50 Amazon gift card!

Click on the following link to take the survey: http://bit.ly/mysciblogreaders

Thanks!

Keith

P.S. Even if you don't take part in the survey, you should still check out Paige's amazing photography, her picture of a Western lowland gorilla is stunning.

Making code available: lab websites vs GitHub vs Zenodo

Our 2007 CEGMA paper was received by the journal Bioinformatics on December 7, 2006. So this means it was about 9 years ago that we had to have the software on a website somewhere with a URL that we could use in a manuscript. This is what ended up in the paper:

Even though we don't want people to use CEGMA anymore, I'm at least happy that the link still works! I thought it would be prudent to make the paper link to a generic top-level page of our website (/Datasets) rather than to a specific page. This is because experience has taught me that websites often get reorganized, and pages can move.

If we were resubmitting the paper today and if we were still linking to our academic website, then I would probably suggest linking to the main website page (korflab.ucdavis.edu) and ensuring that the link to CEGMA (or 'Software') was easy to find. This would avoid any problems with moving/renaming pages.

However, even better would be to use a service like Zenodo to permanently archive the code; this would also give us a DOI for the software too. Posting code to repositories such as GitHub is (possibly) better than using your lab's website, but people can remove GitHub repositories! Zenodo can take a GitHub repository and make it (almost) permanent.

The Zenodo policies page makes it clear that even in the 'exceptional' event that a research object is removed, they still keep the DOI URL working and this page will state what happened to the code:

Withdrawal
If the uploaded research object must later be withdrawn, the reason for the withdrawal will be indicated on a tombstone page, which will henceforth be served in its place. Withdrawal is considered an exceptional action, which normally should be requested and fully justified by the original uploader. In any other circumstance reasonable attempts will be made to contact the original uploader to obtain consent. The DOI and the URL of the original object are retained.

There is a great guide on GitHub for how you can make your code citable and archive a repository with Zenodo.

The Bioboxes paper is now out of the box! [Link]

The Bioboxes project now has their first formal publication, with the software being described today in the journal GigaScience:

I love the concise abstract:

Software is now both central and essential to modern biology, yet lack of availability, difficult installations, and complex user interfaces make software hard to obtain and use. Containerisation, as exemplified by the Docker platform, has the potential to solve the problems associated with sharing software. We propose bioboxes: containers with standardised interfaces to make bioinformatics software interchangeable.

Congratulations to Michael Barton, Peter Belmann, and the rest of the Bioboxes team!

 

Updated 2015-10-15 18.18: Added specific acknowledgement of Peter Belmann.

Should reviewers of bioinformatics software insist that some form of documentation is always included alongside the code?

Yesterday I gave out some JABBA awards and one recipient was a tool called HEALER. I found it disappointing that the webpage that hosts the HEALER software contains nothing but the raw C++ files (I also found it strange that none of the filenames contain the word 'HEALER'). This is what you would see if you go to the download page:

Today, Mick Watson alerted me to a piece of software called ScaffoldScaffolder. It's a somewhat unusual name, but I guess it at least avoids any ambiguity about what it does. Out of curiosity I went to the the website to look at the software and this is what I found:

Ah, but maybe there is some documentation inside that tar.gz file? Nope.

At the very least, I think it is good practice to include a README file alongside any software. Developers should remember that some people will end up on these software pages, not from reading the paper, but by following a link somewhere else. The landing page for your software should make the following things clear:

  1. What is this software for?
  2. Who made it?
  3. How do I install it or get it running?
  4. What license is the software distributed under?
  5. What is the version of this software?

The last item can be important for enabling reproducible science. Give your software a version number — the ScaffoldScaffolder included a version number in the file name — or, at the very least, include a release date. Ideally, the landing page for your software should contain even more information:

  1. Where to go for more help, e.g. a supplied PDF/text file, link to online documentation, or instructions about activating help from within the software
  2. Contact email address(es)
  3. Change log

This is something that I feel that reviewers of software-based manuscripts need be thinking about. In turn, this means that it is something that the relevant journals may wish to start including in the guidelines for their reviewers.

In need of some bogus bioinformatics acronyms? Try these on for size…

Three new JABBA awards for you to enjoy (not that anyone should be enjoying these crimes against acronyms). Hold on to your hats, because this might get ugly.

 

1: From the journal Scientific Reports

The paper helpfully points out that the name of this tool is pronounced 'Cadbury'. I'm not sure if they are saying this to invite a trademark complaint but generally I feel that it is never a good sign when someone has to tell you how to pronounce something. The acronym CADBURE is derived as follows:

Comparing Alignment results of user Data Based on the relative reliability of Uniquely aligned REads

On the plus side, CADBURE mostly uses the initial letters of words. On the negative side, only six out of the fifteen words contribute to the acronym and this is why it earns a JABBA award.

 

2: From the journal BMC Bioinformatics

SPINGO is generated as follows:

Species-level IdentificatioN of metaGenOmic amplicons

So only two words contribute the initial letters, 'identification' donates its second N (but not its first), and 'amplicons' gives us nothing at all. Very JABBA-award worthy.

 

3: From the journal Bioinformatics (h/t to James Wasmuth)

It is very common for people to wait until the end of the Introduction before they reveal how the acronym/initialism in question came to be, but in this paper they don't waste any time at all…it's right there in the title.

It is the mark of a tenuously derived acronym when the initial letter of a word isn't used, but the same letter from a different position in the same word is used, e.g. the second 'R' of 'regression'.

While the code for HEALER is available online, none of the five C++ files contain the word HEALER in their name or anywhere in their code. Nor is there any form of README or accompanying documentation. This is all that you see…

Congratulations to our three worthy JABBA award winners!