Supplemental madness: on the hunt for 'Figure S1'

I've just been looking at this new paper by Vanesste et al.  in Genome Research:

Analysis of 41 plant genomes supports a wave of successful genome duplications in association with the Cretaceous–Paleogene boundary

I was curious as to where their 41 plant genomes came from, so I jumped to the Methods section to see: 

No surprise there, this is exactly the sort of thing you expect to find in the supplementary material of a paper. So I followed the link to the supplementary material only to see this:

So the 'Supplemental Material' contains 'Supplemental Information' and the — recursively named — 'Supplemental Material'. So where do you think Supplemental Table S1 is? Well it turns out that this table is in the Supplemental Material PDF. But when looking at both of these files, I noticed something odd. Here is Figure S1 from the Supplemental Information:

And here is part of another Figure S1 from the Supplemental Material file:

You will notice that the former figure S1 (in the Supplemental Information) is actually called a Supporting Figure. I guess this helps distinguish it from the completely-different-and-in-no-way-to-be-confused Supplementary Figure S1.

This would possibly make some sort of sense if the main body of the paper distinguished between the two different types of Figure S1. Except the paper mentions 'Supplemental Figure S1' twice (not even 'Supplementary Figure S1) and doesn't mention Supporting Figure S1 at all (or any supporting figures for that matter)!

What does all of this mean? It means that Supplementary Material is a bit like the glove compartment in your car: a great place to stick all sorts of stuff that will possibly never be seen again. Maybe we need better reviewer guidelines to stop this sort of confusion happening? 

 

The Assemblathon Gives Back (a bit like The Empire Strikes Back, but with fewer lightsabers)

So we won an award for Open Data. Aside from a nice-looking slab of glass that is weighty enough to hold down all of the papers that someone with a low K-index has published, the award also comes with a cash prize.

Naturally, my first instinct was to find the nearest sculptor and request that they chisel a 20 foot recreation of my brain out of Swedish green marble. However, this prize has been — somewhat annoyingly — awarded to all of the Assemblathon 2 co-authors.

While we could split the cash prize 92 ways, this would probably only leave us with enough money to buy a packet of pork scratchings each (which is not such a bad thing if you are fan of salty, fatty, porcine goodness).

Instead we decided — and by 'we', I'm really talking about 'me' — to give that money back to the community. Not literally of course…though the idea of throwing a wad of cash into the air at an ISMB meeting is appealing.

Rather, we have worked with the fine folks at BioMed Central (that's BMC to those of us in the know), to pay for two waivers that will cover the cost of Article Processing Charges (that's APCs to those of us in the know). We decided that these will be awarded to papers in a few select categories relating to 'omics' assembly, Assemblathon-like contests, and things to do with 'Open Data' (sadly, papers that relate to 'pork scratchings' are not eligible).

We are calling this event the Assemblathon 'Publish For Free' Contest (that's APFFC to those of us in the know), and you can read all of the boring details and contest rules on the Assemblathon website.

The Tesla index: a measure of social isolation for scientists

Abstract

In the era of social media there are now many different ways that a scientist can build their public profile; the publication of high-quality scientific papers being just one. While publishing journal and book articles is a valuable tool for the dissemination of knowledge, there is a danger that scientists become isolated, and remain disconnected from reality, sitting alone in their ivory towers. Such reclusiveness has been long been all too common among academic scientists and we are losing sight of other key outreach efforts such as the use of social media as a tool for communicating science. To help quantify this problem of social isolation, I propose the ‘Tesla Index’, a measure of the discrepancy between the somewhat stuffy, outdated practice of generating peer-reviewed publications and the growing trend of vibrant, dynamic engagement with other scientists and the general public through use of social media.

Introduction

There are many scientists who actively take the time to pursue their science in as much of a public manner as possible. They work hard to ensure that their peers, and the public at large, are kept informed of their latest research. Consider Titus Brown, a genomics and evolution professor at Michigan State University[1]. Although he has contributed to a meagre number of — largely uninteresting — publications[2], he has instead embraced social media[3] to excite and stimulate others with news of his past, current, and future work.

Now consider Nikola Tesla[4]; although he may have forever changed the world through his many scientific inventions[5], he was a famous recluse[6] and surprisingly did not contribute to any blog, nor did he even bother to set up an account on twitter. I am concerned that the anti-social and secretive behavior of Nikola Tesla is something that is all too common in many other scientists, particularly in those who continue their obsession with publishing work that will forever live behind pay-walls, invisible to all but the priviledged few.

I therefore think it’s time that we develop a metric that will clearly indicate if a scientist is a reclusive introvert with no interest in sharing their work with others or engaging with the wider community. This will allow others to adjust our expectations of them accordingly. In order to quantify the problem and to devise a solution, I have compared the numbers of followers that research scientists have on twitter with the number of citations they have for their peer-reviewed work. This analysis has identified clear outliers, or ‘Teslas’, within the scientific community. I propose a new metric, which I call the ‘Tesla Index’, which allows a simple quantification as to the degree of social isolation of any particular scientist.

Results and Discussion

I took the number of Twitter followers as a measure of ‘social outreach and engagement’ while the number of citations was taken as a measure of ‘boring scientific output’. The data gathered are shown in Figure 1.

Figure 1: Twitter followers versus number of scientific citations for a sort-of-random sample of researcher tweeters

I propose that the Tesla Index (T-index) can be calculated as simply the number of Twitter followers a user has, divided by their total number of citations. A low T-index is a warning to the community that researcher 'X' may be forsaking all methods of publicly sharing their work at the expense of soley publishing manuscripts. In contrast, a very high T-index suggests that a scientist is being active in the community, informing and educating their peers, colleagues, and the wider public. They are thus playing a positive role in society. Here, I propose that those people whose T-index is lower than 0.5 can be considered ‘Science Teslas’; these individuals are highlighted in Figure 1.


References

  1. http://ged.msu.edu  ↩

  2. http://scholar.google.com/citations?user=O4rYanMAAAAJ&hl=en  ↩

  3. https://twitter.com/ctitusbrown  ↩

  4. http://en.wikipedia.org/wiki/Nikola_Tesla#Literary_works  ↩

  5. http://theoatmeal.com/comics/tesla  ↩

  6. http://www.viewzone.com/tesla.html  ↩

Acknowledgments

This research was inspired by a piece of completely unrelated work by Neil Hall. 

A CEGMA Virtual Machine (VM) is now available!

Last week I blogged about the ever growing popularity of CEGMA and also the problems of maintaining this difficult-to-install piece of software. In response to that post, people helpfully pointed out that you can more easily install/run CEGMA by using package managers such as Homebrew and/or even run CEGMA on a dedicated Amazon Machine Instance.

These responses led me to update the CEGMA FAQ to list all of the alternative methods of getting CEGMA to run (including running it as an iPlant application). I’m happy that I can today announce a new addition to this list: CEGMA is now available through virtualization:

Our CEGMA VM runs the Ubuntu operating system and is pre-configured to have everything installed that CEGMA needs. I’ve tested the VM using the free VirtualBox software and it seems to work just fine [1].

This also means that I will no longer be offering a service to run CEGMA on behalf of others. I had previously offered to run CEGMA for people who had trouble installing the software (or more commonly, the pieces of software that CEGMA requires). I’ve run CEGMA over 100 times for others and this has been a bit of a drain on my time to say the least. Hopefully, our CEGMA VM is a viable alternative. Many thanks are due to Richard Feltstykket at the UC Davis Genome Center’s Bioinformatics Core for setting this up.


  1. Words that will come back to haunt me I expect!  ↩

101 questions with a bioinformatician #12: Karen Eilbeck

101 questions.png

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting theirbioinformatics careers.


Karen Eilbeck is an Associate Professor of Biomedical Informatics at the University of Utah. Karen comes from a long line of distinguished bioinformaticians who learned their skills at the highly regarded Bioinformatics M.Sc. program at the UK's University of Manchester (although they do let some riff-raff in).

If you read Karen's research statement, you will see that there is a clear focus to her work:

Quality control of genomic annotations; Management and analysis of personal genomics data; Ontology development to structure biological, genomic and phenotypic data

In helping build both the Gene Ontology and Sequence Ontology resources, Karen's work has led to the development of powerful structured vocabularies that help ensure that all biologists can speak the same language. Developing ontologies is harder than you might imagine, especially when you are trying to generate precise definitions for very nebulous concepts such as what is a gene?

You can find out more about Karen from the Eilbeck Lab website. And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

I think genomic analysis is fascinating. The human genetics stories suck me in, where bioinformatics is used to find the variant causing the phenotype. The story does not end there, tests are developed, or therapies targeted. 

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

This is a positive and a negative. I like being part of collaborative projects. It is exciting and things get done. The downside is the amount of time on the phone. It is not something I would ever have anticipated. Conference calls either go OK, or someone is heavy breathing in a train station and hasn’t put their phone on mute. The video conference is either delayed or the resolution is not great. One of my colleagues shared this video with me, which has a lot of truth to it.

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Only a single piece? OK, take your math classes more seriously. I wish I had known how to program when I was doing statistics classes. Instead of using packages like SPSS it may have been more educational to implement tests myself. 

 

100. What's your all-time favorite piece of bioinformatics software, and why?

I am totally in love with a piece of software right now called Phevor, which re-ranks variant prioritization based on phenotype descriptions and uses a variety of ontologies to do its magic. Which brings me to my all time fave tool: OBO-Edit. I think that OBO-edit was underrated. This tool was developed by the Gene Ontology consortium to build their ontology, and it rapidly became adopted by the biological community. It is easy to use and underpinned many of the ontologies in the bioinformatics domain today. The lead developer for a long time was John Richter who is also a stand-up comedian that went on to work for Google. OBO-edit will always have a place in my heart

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

W (A or T). On the one hand it's reserved and to the point (A), on the other hand it's full of energy and works well with others (T). Also, much like my name, there is confusion when it comes to pronunciation (Eel-beck or I’ll-Beck).

W and its skinny friend V, seem interchangeable regarding pronunciation. A friend of mine calls a character from Star Wars Darth Wader, which make me smile. 

Good news: CEGMA is more popular than ever — Bad news: CEGMA is more popular than ever

I noticed from my Google Scholar page today that our 2007 CEGMA paper continues to gain more and more citations. It turns out that there have now been more citations to this paper in 2014 than in any previous year (69 so far and we still have almost half a year to go):

Growth of citations to CEGMA paper, as reported by Google Scholar

Growth of citations to CEGMA paper, as reported by Google Scholar

I've previously written about the problems of supporting software that a) was written by someone else and b) is based on an underlying dataset that is now over a decade old. These problems are not getting any easier to deal with.

In a typical week I receive 3–5 emails relating to CEGMA; these are mostly requests for help with installing and/or running CEGMA, but we also receive bug reports and feature requests. We hope to shortly announce something that will help with the most common problem, that of getting CEGMA to work. We are putting together a virtual machine that will come pre-installed and configured to run CEGMA. So you'll just need to install something like VirtualBox, and then download the CEGMA VM. Hopefully we can make this available in the coming week or two.

Unfortunately, we have almost zero resources to devote to the continuing development of this old version of CEGMA; any development that does happen is therefore extremely limited (and slow). A forthcoming grant submission will request resources to completely redevelop CEGMA and add many new capabilities. If this grant is not successful then we may need to consider holding some sort of memorial service for CEGMA as it becoming untenable to support the old code base. Seven years of usage in bioinformatics is a pretty good run and the website link in the original paper still works (how many other bioinformatics papers can claim this I wonder?).

 

Update: 2014-07-21 14.44

Shaun Jackman (@sjackman on twitter) helpfully reminded me that CEGMA is available as a homebrew package. There is also an iPlant application for CEGMA. I've added details of both of these to a new item in the CEGMA FAQ:

 

Update: 2014-07-22 07.36

Since publishing this post, I've been contacted by three different people who have pointed out different ways to get CEGMA running. I'm really glad that I blogged about this else I may not have found about these other methods.

In addition to Shaun's suggestion (above), it seems that you can also install CEGMA on Linux using the Local Package Manager software. Thanks to Masahiro Kasahara for bringing this to my attention. Finally, Matt MacManes alerted me to the fact that their is a public Amazon Machine Instance called CEGMA on the Cloud. More details here.

 

Update: 2014-07-30 19.31

Thanks to Rob Syme, there is now a Docker container for CEGMA. And finally, we have now made a Ubuntu VM that is pre-installed with CEGMA (thanks to Richard Feltstykket at the UC Davis Genome Center's Bioinformatics Core).

None shall stare into the face of Medusa (or Medusa, or MeDUSA): more bioinformatics tools that use the same name

Following on from yesterday's post where I pointed out that there are three completely different bioinformatics tools that are all called 'Kraken', I bring you more news of the same. Scott Edmunds (@SCEdmunds on twitter) brought to my attention today that there is some bioinformatics software called Medusa that is either:

  1. A tool from 2005 for interaction graph analysis
  2. Some software published in 2011 that explores and clusters biological networks
  3. Or an acronym for a 2012 resource (MeDUSA) that can be used for methylome analysis

So if someone asks you to install Kraken and Medusa, it's good to know that there's only nine different combinations of tools that they might be referring to.

You wait ages for somebody to develop a bioinformatics tool called 'Kraken' and then three come along at once

I recently wrote about the growing problem of duplicated names for bioinformatics tools. A couple of weeks ago, Stephen Turner (@genetics_blog) pointed out another case:

So Kraken is either a universal genomic coordinate translator for comparative genomics, or a tool for ultrafast metagenomic sequence classification using exact alignments, or even a set of tools for quality control and analysis of high-throughput sequence data. The latter publication is from 2013, and the other two are from this year (2014).

I feel sorry for the poor Grad Student who is going to lose a day of their life trying to install one of these tools before realizing that they have been installing the wrong Kraken.

Which 'omics' assembly tools are currently the most popular?

I recently organized an online poll to find out which tools for genome, transcriptome, and metagenome assembly are currently the most popular with researchers. After a week or so of collecting results, I ended up with 116 responses that describe over 30 different assembly tools.

Thanks to everyone who took part. I've posted the results to Figshare as a PDF report, and have also embedded this below (I suggest downloading the PDF so that you can use all of the embedded hyperlinks in the report).

Impactstory: Publications are an important part of research…but they’re not the only part

I'm a great fan of the Impactstory service that makes it easy to aggregate all of your research output in one place, and then see how people are engaging with your research. I like it so much, that I signed up to be an Impactstory Advisor.

Today I'm giving a talk at UC Davis about Impactstory, and so that everyone can see why I like this service so much, I've made a video version of my presentation. 

Visit the Impactstory website to find out more or follow them on twitter (@Impactstory). For an example of the types of things that Impactstory can track, have a look at my own Impactstory page (impactstory.org/KeithBradnam) .