The problem that arises when Google Scholar indexes papers published to pre-print servers

The Assemblathon 2 paper, on which I was lead author, was ultimately published with the online journal Gigascience. However, like an increasing number of papers, it was first released to the arXiv.org pre-print server.

If you are a user of the very useful Google Scholar service and you have also published a paper such that it appears in two places, then you may have run into the same problems that I have. Namely, Google Scholar appears to only track citations to the first place where the paper was published.

It should be said that it is great that Google tracks citations to these pre-print articles at all, though see another post of mine that illustrates just how powerful (and somewhat creepy), Google Scholar's indexing power is. However, most people would expect that when a paper is formally published, that Google Scholar should track citations to that as well. Preferably separately from the pre-print version of the article.

For a long time with the Assemblathon 2 paper, Google Scholar only seemed to show citations to the pre-print version of the paper, even when I knew that others were citing the Gigascience version. So I contacted Google about this, and after a bit of a wait, I heard back from them:

Hi Keith,

It still get indexed though the information is not yet shown:

http://scholar.google.com/scholar?q=http%3A%2F%2Fwww.gigasciencejournal.com%2Fcontent%2F2%2F1%2F10+&btnG=

If one version (the arXiv one in this case) was discovered before the last major index update, the information for the other versions found after the major update would not appear before the next major update.

Their answer still raises some issues, and I'm waiting to hear back from my follow up question...how often does the index get updated? Checking Google Scholar today, it initially appears as if they are still only tracking the pre-print version of our paper:

2014-01-27 at 9.36 AM.png

However, after checking I see that 9 out of 10 of the most recent citations are all citing the Gigascience version of the paper. So in conclusion:

  1. Google Scholar will start to track formal versions of a publication even after the paper was first published on a pre-print server.
  2. Somewhat annoyingly, they do not separate out the citations and so one Google Scholar entry ends up tracking two versions of a paper.
  3. The Google Scholar entry that is tracking the combined citations only lists the pre-print server in the 'Journal' name field; you have to check individual citations to see if they are citing the formal version of the publication.
  4. Google Scholar has a 'major' indexing cycle and you may have to wait for the latest version of the index to be updated before you see any changes.