Searching for sausage rolls: using Google Scholar to look at the popularity of British culinary delights

Sometimes it can be fun to search Google Scholar for words or phrases that you might not expect to ever appear in the title of an academic article. So last night, I conducted an important scientific study and looked at the popularity of various quintessential items of Britsh cuisine:

Updated: 2014-12-10: includes addition of 'Spotted Dick' thanks to reader @MattBashton.

Is yours bigger than mine? Big data revisited

Google Scholar lists 2,090 publications that contain the phrase 'big data' in their title. And that's just from the first 9 months of 2014! The titles of these articles reflect the interest/concern/fear in this increasingly popular topic:

One paper, Managing Big Data for Scientific Visualization, starts out by identifying a common challenge of working with 'big data':

Many areas of endeavor have problems with big data…while engineering and scientific visualization have also faced the problem for some time, solutions are less well developed, and common techniques are less well understood

They then go on to discuss some of the problems of storing 'big data', one of which is listed as:

Data too big for local disk — clearly, not only do some of these data objects not fit in main memory, but they do not even fit on local disk on most workstations. In fact, the largest CFD study of which we are aware is 650 gigabytes, which would not fit on centralized storage at most installations!

Wait, what!?! 650 GB is too large for storage? Oh yes, that's right. I forgot to mention that this paper is from 1997. My point is that 'big data' has been a problem for some time now and will no doubt continue to be a problem.

I understand that having a simple, user-friendly, label like 'big data' helps with the discussion, but it remains such an ambiguous, and highly relative term. It's relative because whether you deem something to be 'big data' or not might depend heavily on the size of your storage media and/or the speed of your networking infrastructure. It's also relative in terms of your field of study; a typical set of 'big data' in astrophysics might be much bigger than a typical set of 'big data' in genomics.

Maybe it would help to use big dataTM when talking about any data that you like to think of as big, and then use BIG data for those situations where your future data acquisition plans cause your sys admin to have sleepless nights.

5 things to consider when publishing links to academic websites


One of the reasons I've been somewhat quiet on this blog recently is because I've been involved with a big push to finish the new Genome Center website. This has been in development for a long time and provides a much needed update to the previous website that was really showing its age. Compare and contrast:

The old Genome Center website…what's with all that whitespace in the middle?

The new Genome Center website, less than 24 hours old at the time of writing.

This type of redesign is a once-in-a-decade event, and provides the opportunity not just to add new features (e.g. proper RSS news feed, twitter account, YouTube channel, respsonvive website design etc.), but also to clean up a lot of legacy material (e.g. webpages for people who left the Genome Center many years ago).

This cleanup prompted me to check Google Scholar to see if there are any published papers that include links to Genome Center websites. This includes links to the main site and also to all of the many subdomains that exist (for different labs, core facilities etc.) It's pretty easy to search Google Scholar for the core part of a URL, e.g. and I would encourage anyone else that is looking after an aging academic website to do so.

Here are some of the key things that I noticed:

  1. Most mentions of Genome Center URLs are to resources by Peggy Farnham's lab. Although Peggy left UC Davis several years ago (she is now here), her — very old, and out of date — lab page still exists (
  2. Many people link to Craig Benham's work using This redirects to Craig's own lab site (, but the redirect doesn't quite work when people have linked to a specific tool (e.g. This redirects to which then produces a 404 error (page not found).
  3. There are many papers that link to resources from Jonathan Eisen's group and these papers all point to various pages on a domain that is either down or no longer in existence (

There is an issue here of just how long is it valid to try to keep links active and working. In the case of Peggy Farnham, she no longer works at UC Davis, so is it okay if I redirected all of her web traffic to her new website? I plan to do this but will let Peggy know so that she can maybe arrange to copy some of the existing material over to her new site.

In the case of Craig's lab, maybe he should be adding his own redirect links for tools that now have new URLs. What would also help would be to have a dedicated 404 page which might point to the likely target page that people are looking for (a completely blank 'not found' page is rarely ever helpful).

In the case of Jonathan's lab, there is a big problem here in that all of the papers are tied to a very specific domain name (which itself has no obvious naming connection to his lab). You can always rename a new machine to be called 'bobcat', but maybe there are better things we should be doing to avoid these situations arising in the first place…

5 things to consider when publishing links to academic websites

  1. Don't do it! Use resources like Figshare, Github, or Dryad if at all possible. Of course this might not be possible if you are publishing some sort of online software tool.
  2. If you have to link to a lab webpage, consider spending $10 a year or so and buying your own domain name that you can take with you if you ever move anywhere else in future. I bought for my boss, and I see that Peggy Farnham is now using
  3. If you can't, or don't want to, buy your own domain name, try using a generic lab domain name and not a machine-specific domain name. E.g. our lab's website is on a machine called 'raiden' and can be accessed at But we only ever use the domain name which allows us to use a different machine as the server without breaking any links.
  4. If you must link to a specific machine, try avoiding URLs that get too complex. E.g. The more complex the URL, the more likely it will break in future. Instead, link to your top level domain ( and provide clear links on that page on how to find things.
  5. Any time you publish a link to a URL, make sure you keep a record of this in a simple text file somewhere. This might really help if/when you decide to redesign your website 5 years from now and want to know whether you might be breaking any pre-existing links.


Good news: CEGMA is more popular than ever — Bad news: CEGMA is more popular than ever

I noticed from my Google Scholar page today that our 2007 CEGMA paper continues to gain more and more citations. It turns out that there have now been more citations to this paper in 2014 than in any previous year (69 so far and we still have almost half a year to go):

Growth of citations to CEGMA paper, as reported by Google Scholar

Growth of citations to CEGMA paper, as reported by Google Scholar

I've previously written about the problems of supporting software that a) was written by someone else and b) is based on an underlying dataset that is now over a decade old. These problems are not getting any easier to deal with.

In a typical week I receive 3–5 emails relating to CEGMA; these are mostly requests for help with installing and/or running CEGMA, but we also receive bug reports and feature requests. We hope to shortly announce something that will help with the most common problem, that of getting CEGMA to work. We are putting together a virtual machine that will come pre-installed and configured to run CEGMA. So you'll just need to install something like VirtualBox, and then download the CEGMA VM. Hopefully we can make this available in the coming week or two.

Unfortunately, we have almost zero resources to devote to the continuing development of this old version of CEGMA; any development that does happen is therefore extremely limited (and slow). A forthcoming grant submission will request resources to completely redevelop CEGMA and add many new capabilities. If this grant is not successful then we may need to consider holding some sort of memorial service for CEGMA as it becoming untenable to support the old code base. Seven years of usage in bioinformatics is a pretty good run and the website link in the original paper still works (how many other bioinformatics papers can claim this I wonder?).


Update: 2014-07-21 14.44

Shaun Jackman (@sjackman on twitter) helpfully reminded me that CEGMA is available as a homebrew package. There is also an iPlant application for CEGMA. I've added details of both of these to a new item in the CEGMA FAQ:


Update: 2014-07-22 07.36

Since publishing this post, I've been contacted by three different people who have pointed out different ways to get CEGMA running. I'm really glad that I blogged about this else I may not have found about these other methods.

In addition to Shaun's suggestion (above), it seems that you can also install CEGMA on Linux using the Local Package Manager software. Thanks to Masahiro Kasahara for bringing this to my attention. Finally, Matt MacManes alerted me to the fact that their is a public Amazon Machine Instance called CEGMA on the Cloud. More details here.


Update: 2014-07-30 19.31

Thanks to Rob Syme, there is now a Docker container for CEGMA. And finally, we have now made a Ubuntu VM that is pre-installed with CEGMA (thanks to Richard Feltstykket at the UC Davis Genome Center's Bioinformatics Core).