The scary indexing power of Google Scholar

October 07, 2013 by Keith Bradnam

Today I received notification of a paper that cites some work I was involved in. The paper was a Word Document, that was inside a zip archive, that was on an FTP site of a very-famous-mutinational-software-company!

Some free code editors for Macs (that work in a UC Davis computer lab)

October 05, 2013 by Keith Bradnam

Every year I help teach a course[1] to grad students that hopefully leaves them with an understanding of how to use Unix and Perl to solve bioinformatics-related problems. We use a Mac-based computer lab because all Macs come with Unix and Perl installed. Many of our students are new to programming and many are new to Macs. Because of this, and because they need to use a code editor to write their Perl scripts, we have previously pointed them towards Fraise. Despite its age [2], this relatively lightweight editor has proved fine for the last few years that we have taught this course.

This year, however, Fraise proved problematic. The computer lab has now upgraded to OS X 10.8 which provides extra safeguards about what apps can be run. This Gatekeeper technology has been set up to only allow ‘signed’ apps[3]. The version of Fraise that we were using required administrator privileges for it to be opened (not possible in this computer lab).

My first thought was to direct students to download and install TextWrangler. This is an excellent, powerful, and well maintained code editor for Macs. Most importantly, it is free and also a signed app. However, it does try to install a helper app which caused a persistent dialog window to keep popping up during the installation. Clicking ‘cancel’ worked…but only after about 20 attempts[4]. I like TextWrangler as an app, but prefer the cleaner look of Fraise. So today I set out to find code editors for Macs that:

were free
could be run on the Macs in our computer lab (i.e. had to be signed apps)
were relatively simple to use and/or were easy on the eye

Here is what I came up with. These are all apps that seem to be under current development (updated at some point in 2013):

AppSize in MB Free? Notes

Komodo Edit301.1YesBig because it is part of a larger IDE tool which is not free[5]

Sublime Text 227.3sort of[6]Gaining in popularity (a version 3 beta is also available)

TextMate 230.3YesWhile this is technically an ‘alpha’[7] release, it seems very stable.

TextWrangler19.2YesVery robust and venerable app. Free since 2005

Tincta 25.6YesSmall app, similar to Fraise in appearance

If I had to suggest one, it would probably be Sublime Text 2 (though I will encourage students to buy this if they like it). TextMate 2 is a good second choice, particularly because it is also a very clean looking app. Of course, at some point we need to tell students about the joys of real text editors such as vi, vim, and emacs…but of course this might lead to hostilities![8]

This course material is available for free and became the basis for our much more expansive book on the same topic ↩
Fraise is itself a fork of Smultron which stopped development in 2009 but which reappeared as a paid app in the Mac App Store in 2011. ↩
Those apps that are approved by Apple, even if they are not in the Mac App Store. ↩
Seriously, it takes a lot of clicks to make this dialog box go away. It then produces more pop-up dialogs asking whether you want to register, or install command-line tools. ↩
Currently $332 for a single license ↩
This is a paid app, but can be used in trial mode indefinitely with occasional reminders. ↩
TextMate 2 has been in alpha status since 2009 ↩
Editor wars should be avoided if possible ↩

How well do UC Davis Graduate Groups communicate their work to the wider world?

September 26, 2013 by Keith Bradnam

PhD students in our lab are mostly split between a couple of UC Davis's many graduate groups. A conversation with some of the students today about 'outreach' and 'social media' led me to wonder how well these graduate groups are communicating their presence to the outside world. The simplest ways of doing this would be:

maintain a current website for your graduate group (i.e. with news items)
use Facebook (ideally with an open group)
establish a blog
use twitter

I looked at 11 different graduate groups to see how well they ticked the above boxes. I might be missing some blogs, Facebook groups, and twitter accounts, but if I can't find the relevant details from a Google/Facebook/Twitter search, then I'm assuming that others won't discover them either. This is what I found:

Headline links take you to the home page for the respective graduate group.

Biochemistry, Molecular, Cellular and Developmental Biology (BMCDB)

No news page but has an actively maintained blog (easily linked to from above site)
Facebook group (open)
Active twitter account

Biomedical Engineering (BME)

No news page, but has a frequently updated website for the Biomedical Engineering Student Association (BESA)
Facebook group (open)
No twitter account (though College of Engineering is on twitter)

Biostatistics

No news page, though there is a short 'announcements' box on main page
No Facebook group
No twitter account

Ecology

Has news page, but last news item is from Aug 2011
Facebook group (open)
Active twitter account
Has separate website and blog

Epidemiology

No news page
Facebook group (closed)
No twitter account

Integrative Genetics and Genomics (IGG, formerly GGG)

Has a news page, but only one item from 2013, remaining items from 2009 and 2008!
Facebook group (closed, and have to search for GGG or IGG to find it)
No twitter account

Immunology

No news page
No Facebook group
No twitter account

Microbiology

No news page
Facebook group (closed)
No twitter account
Has separate website
Other: website told me I had to enable Javascript to view their home page even though I have javascript enabled

Nutritional biology

No news page
No Facebook group
No twitter account

Plant Biology

No news page
No Facebook group
No twitter account

Population Biology

No news page
No Facebook group
No twitter account

Please let me know of any updates or additions that I can make to this list

So overall it is pretty poor. BMCDB outshines the others, though BME and Ecology also have a good presence on the web. In many ways, I think it looks worse to do these things badly than to not to them at all. Closed Facebook groups don't send out an inviting message, and having a 'news' page for your graduate group with items from 5 years ago, also sends out the wrong signals.

It takes time and effort to maintain a social media presence, but it doesn't take much effort to at least maintain a news page or simple twitter account (even posting just 1–2 times a week is better than nothing).

Furthermore, the ability to show that you can communicate your work to the wider world is of increasing relevance when applying for grants. It can also raise your profile with your peers and be a useful addition on a resume that helps you stand out from other applicants. Finally, starting a blog or twitter account also helps you hone your writing skills (the latter is great for making you think about how to condense complex thoughts into 'bite size' chunks).

I hope that some of UC Davis's graduate groups make more of an effort in this area (and of course the same can be said for many of UC Davis's departmental and lab websites).

Updated 26th September: Added details of some graduate groups that do have blogs and/or websites but which, unhelpfully, are not linked to from their official graduate group webpage.

New JABBA awards for crimes against bioinformatics acronyms

September 20, 2013 by Keith Bradnam

JABBA is an acronym for Just Another Bogus Bioinformatics Acronym. I hand out JABBA awards to bioinformatics papers that reach just a little bit too far when trying to come up with acronyms (or intialisms). See details of previous winners here.

So without further delay, let's see who the new recipients are. As always, journals like Bioinformatics produce many strong candidates for JABBA awards. Here are three winners from the latest issue:

mpMoRFsDB: a database of molecular recognition features in membrane proteins - What is it with the need for so much use of mixed-case characters these days? It makes names harder to read, and this database has a name which doesn't really roll off the tongue.
CoDNaS: a database of conformational diversity in the native state of proteins - More mixed case madness. I'm really unsure how this database should be pronounced. 'Cod Naz'? 'Code Nass'? 'Coe Dee-en-ays'? The abstract suggests that the acronym stems from the name 'Conformational Diversity of Native State', so I guess we should be thankful that they avoided the potential confusion of naming this database 'CoDoNS'.
GALANT: a Cytoscape plugin for visualizing data as functional landscapes projected onto biological networks - If you want your bioinformatics tool to have a cool sounding name, just give it that name. Don't feel that you somehow have to tenuously arrive at that name from a dubious method of selecting just the letters that make it work. The abstract of this paper reveals that GALANT stems from 'GrAph LANdscape VisualizaTion'. So the full name only consists of three words, and the initial letter of one of those words doesn't even make it into the acronym/initialism. They may as well call their tool 'GREAT' (GRaph landscapE visualizATion).

The joys of dealing with well-annotated 'reference' genomes

August 31, 2013 by Keith Bradnam

Important update included at bottom of this post

Arabidopsis thaliana has one of the best curated eukaryotic genome sequences so you might expect working with Arabidopsis data to be a joy? Well, not always. Consider gene AT1G79920. This gene has two coding variants (AT1G79920.1 and AT1G79920.2) which only seem to differ in their 5' UTR regions:

The primary source for Arabidopsis genome information is The Arabidopsis Information Resource (TAIR) but another site called Phytozome has started collating all plant-related genome data (often pulling it from sites like TAIR).

Both sites allow you to download coding sequences for each gene from their FTP sites, or view the same sequence information on their web sites. You could also download GFF files and using the coordinate information, extract the coding sequences from the whole genome sequence.

A simple sanity check when working with coding sequences is that the length of a coding sequence should be divisible by 3 otherwise there might be a frameshift. There seems to be an issue with AT1G79920 in that it sometimes has a frameshift. I say sometimes because it depends on where you look at the data.

Here is a sequence alignment for exons 5 and 6 of both coding variants of this gene. The sequence identifiers include a 'P' or 'T' to indicate whether the information came from Phytozome or TAIR and they also denote whether it was taken from their web or FTP site. I also insert 'gtag' into one sequence to illustrate where the intron between these exons would occur.

The first boxed rectangle highlights a base that looks like a SNP, but these are coding variants with UTR variation, so the underlying genome sequence in the coding regions should be identical.

The second box is perhaps more disturbing. The web site versions of exon 7 all have a 1 bp deletion which would lead to a frameshift. I guess this error started at TAIR and propagated to Phytozome, but the fact that both sites also have a correct version available on their FTP site is confusing and troubling.

My boss first discovered this by looking at the GFF files for the Arabidopsis genome and this was one of 25 genes with a 'not-divisible-by-3' length error. So it pays to always check — and double check — your data.

Time to send an email to TAIR and Phytozome to report this.

Update

I heard back from TAIR and Phytozome and it seems that there are a small number of likely genome sequence errors in the latest (TAIR10) release of the A. thaliana genome. When these affect genes and would introduce a frameshift error, TAIR make manual edits to the resulting files of CDSs, transcripts, and proteins. They do this when there is clear homology from genes in other species that suggests the change should be made.

So if you work with downloaded FASTA files from the FTP site, you won't see these errors. If you work from GFF files (which is presumably what some of their web displays do), you'll run into these issues. There is a small file (TAIR10sequenceedits.txt) included as part of the TAIR10 release which documents these changes.

Thanks to a speedy and helpful response from both sites. Perhaps I should retitle this post The Joys of Dealing with Fast and Knowledgable Genome Curators?

Crappy science spam from Photon Journals

August 30, 2013 by Keith Bradnam

I've started receiving more and more science-related spam recently. Some of these are semi-legitimate, like those from the OMICS group, but should still be avoided (here's a good cautionary tale of what to expect if you go to an OMICS Group conference).

Sometimes I'm just disappointed by how little effort these spammers take to make their emails looks professional. Over the last couple of weeks I've received four emails from 'Photon Journals' asking me to submit my research to some of their (many) journals. To give you an idea of how lame their attempts at spamming people are, consider that:

Their email comes from a Hotmail account.
They address me as 'Dear Dr. Krbradnam' (clearly they have phished my academic email address from somewhere and just assumed that this is my surname).
Their email is written in purple!
Two emails on the same day ask me to submit material to two completely different journals (see below).
The 'website' for the journal is hosted on Google Sites
The website looks like it was designed by a five year old on acid (see below).

The modus operandi of these fake journals is just to charge you to submit your paper (and really it can be any paper, no-one is going to check it, let alone read it). I guess we should be grateful that some of these are so obviously fake, that it makes it easier to ignore them. Several sites are now collating list of all of these bogus journals (there are a lot).

New JABBA award recipients for the bogus use of bioinformatics acronyms

August 28, 2013 by Keith Bradnam

The latest issue of the journal Bioinformatics was released today and as with most issues, it features a large number of bioinformatics tools and resources. Many of these tools feature questionable use of acronyms and initialisms and so it is time to hand out some new JABBA awards.

A quick reminder, JABBA stands for 'Just Another Bogus Bioinformatics Acronym'. I recently gave out the inaugural JABBA award and have since discovered that JABBA is not the only game in town (if the game is critically evaluating the overreaching use of acronymns in science). Step forward ORCA - the Organisation of Really Contrived Acronyms, a blog that was started by an old bioinformatics colleague of mine. I highly recommend checking this out to see more acronym-derived-crimes.

Anyway, here are the several nominations for JABBA awards from the latest issue of Bioinformatics:

MaGnET: Malaria Genome Exploration Tool - My main issue with this one is the ungainly capitalization. There is also the issue that the tool is in no way related to magnets so if you don't remember the fact that it is called MaGnET then it it hard to find. A Google search for malaria tool doesn't feature MaGnET on the first page of hits.
MMuFLR: missense mutation and frameshift location reporter - Okay, so part of me thinks that this is probably meant to be pronounced 'muffler'? If this is not the case, then it doesn't really roll of the tongue. "Oh you're looking to find missense and frameshift mutations? Then you should try em-em-yoo-ef-el-ar". It doesn't help when you follow the link in the article to find the resource (a Galaxy workflow) only to find no mentions of MMuFLR on the resulting page (unless you search for it).
mRMRe: an R package for parallelized mRMR ensemble feature selection - Some of the same reasons that applied to MMuFLR also apply to mRMRe. Try saying this three times fast and you'll see what I mean. It seems that mRMRe is an 'ensemble' implementation of something called mRMR (minimum redundancy maximum relevance). Still not sure why the first 'm' is free from the need of capitalization (or the 'e' for 'ensemble').
TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference - Aways a bit suspicious when you have a tool name that features 15 words but somehow gets reduced to 5 letters for the abbreviated name. Could equally make a case that this tool should be called TIRBI. In any case, TIGAR is quite an unusual spelling and you might think that a Google search for TIGAR would put this resource near the top of the results. However it also seems that TIGAR is the The Inland Gateway Association of Realtors as well as The International Gymnastics Academy of the Rockies. But perhaps more importantly, TIGAR already has a scientific connection as it also the name of a gene (TIGAR: TP53-induced glycolysis and apoptosis regulator00762-8)).

Thanks to Bioinformatics journal for providing some new JABBA recipients.

Mining Altmetric data to discover what types of research article get the most social media engagement

July 27, 2013 by Keith Bradnam

Altmetric is a service that tracks the popularity of published research articles via the impact that those articles make on social media sites. Put simply, the more an article is tweeted and blogged about, the higher its Altmetric score will be. The scoring system is fairly complex as it also tracks who is reading your article on sites such as Mendeley.

I was pleased to see that the recent Assemblathon 2 paper — on which I am the lead author — gained a lot of mentions on twitter. Curious, I looked up its Altmetric score and was surprised to see that it was 61. This puts it in in the 99th percentile of all articles tracked by Altmetric (almost 1.4 million articles). I imagine that the score will possibly rise a little more in the coming weeks (at the time of writing the paper is only four days old).

I was then curious as to how well the Assemblathon 1 paper fared. Published in September, 2011, it has an Altmetric score of 71. This made me curious as to where both papers ranked in the entire list of 1,384,477 articles tracked by this service. So I quickly joined the free trial of Altmetric (they have a few paid services) and was able to download details of the top 25,000 articles. This revealed that the two Assemblathon papers came in at a not-too-shabby 5,616th and 10,250th place overall. If you're interested, this paper — with an off-the-charts Altmetric score of 11,152 — takes the top spot.

Just to satisfy my curiosity, I made a word cloud based on the titles of the research papers that appear in the top 10,000 Altmetric articles.

Perhaps unsurprisingly, this reveals that analyses (and meta-analyses) of data relating to human health prompt a lot of engagement via social media. Okay, time to go and write my next paper 'A Global Study of Human Health Risks'.

Thinking of writing a FASTA parser? 12 scenarios that might break your code

July 26, 2013 by Keith Bradnam

In today's Bits and Bites Coding class, we talked about the basics of FASTA and GFF file formats. Specifically, we were discussing the problems that can arise when you write scripts to parse these files, and the types of problems that might be present in the file which may break (or confuse) your script.

Even though you can find code to parse FASTA files from projects such as BioPerl, it can be instructive to try to do this yourself when you are learning a language. Many of the problems that occur when trying to write code to parse FASTA files will fall into the 'I wasn't expecting the file to look like that' category. I recently wrote about how the simplicity of the FASTA format is a double-edged sword. Because almost anything is allowed, it means that someone will — accidentally or otherwise — produce a FASTA file at some point that contains one of the following 12 scenarios. These are all things that a good FASTA parser should be able to deal with and, if necessary, warn the user:

> space_at_start_of_header_line
ACGTACGTACGTACGT

>Extra_>_in_FASTA_header
ACGTACGTACGTACGT

>Spaces_in_sequence
ACGTACGT ACGTACGT

>Spaces_in_sequence_and_between_lines
A C 

G T 

A C

A G 

A T

>Redundant_sequence_in_header_ACGTACGTACGT
ACGTACGTACGTACGT

><- missing definition line
ACGTACGTACGTACGT

>mixed_case
ACGTACGTACGTACGTgtagaggacgcaccagACGTACGTACGTACGT

>missing_sequence

>rare, but valid, IUPAC characters 
ACGTRYSWKMBDHVN

>errant sequence
Maybe I accidentally copied and pasted something
Maybe I used Microsoft Word to edit my FASTA sequence

>duplicate_FASTA_header
ACGTACGTACGTACGT
>duplicate_FASTA_header
ACGTACGTACGTACGT

>line_ending_problem^MACGTACGTACGTACGT^MACGTACGTACGTACGT^M>another_sequence_goes_here^MACGTACGTACGTACGT^MACGTACGTACGTACGT

Why is N50 used as an assembly metric (and what's the deal with NG50)?

July 08, 2013 by Keith Bradnam

This post started out as a lengthy reply on a thread at SEQanswers.com. I thought that it was worth reproducing it here (though I've also added some more information)...

Why N50?

Maybe it is worth remembering why we even have N50 as a statistic. The final assembly for any genome project, i.e. the one that is described in a paper or uploaded to a database, is not necessarily the biggest assembly that was generated.

This is because there will always be contigs that are too short to be of any use to anyone. In the pre-NGS era of assembly, these contigs could potentially be represented by a single Sanger read that didn't align to anything else.

However, one might consider that there is still some utility in including a long Sanger read in an assembly, even if it doesn't overlap with any other reads. After quality trimming, there are many Sanger reads that are over 1,000 bp in length and this is long enough to contain an exon or two, or even an entire gene (depending on the species). But what if a contig was 500 bp? Or 250 bp? Or 100 bp? Where do you draw the line?

Clearly it is not desirable to fill up an assembly with an abundance of overly short sequences, even if they represent accurate, and unique, biological sequences from the genome of interest. So it has always been common to to remove very short contigs from an assembly. The problem is that different groups might use very different criteria for what to keep and what to exclude. Imagine a simple assembly consisting of the following contigs:

5,000 x 100 bp
100 x 10 Kbp
10 x 1 Mbp
1 x 10 Mbp

Now let's say that Distinguished Bioinformatics Center #1 decides to produce an assembly (DBC1) that includes all of these contigs. However, another DBC decides to make another assembly from the same data, but they remove the 100 bp contigs. A third DBC decides to also remove the 10 Kbp contigs. What does this do to the mean contig length of each assembly?

Mean contig lengths

DBC1 = 4,207 bp
DBC2 = 189,189 bp
DBC3 = 1,818,182 bp

Hopefully it becomes obvious that if you only considered mean contig length, it would be extremely easy to 'game' the system by deliberately excluding shorter contigs (no matter their utility) just to increase the average contig length. But what do we get if we instead rely on N50?

N50 contig lengths

DBC1 = 1 Mbp
DBC2 = 1 Mbp
DBC3 = 10 Mbp

This now reduces the overall discrepancies, and puts DBC1 and DBC2 on an equal footing. But you might still, naively, conclude that DBC3 is the better assembly, and if you were extremely wide-eyed and innocent, then maybe you would conclude that DBC3 was ten times better than the other two assemblies.

So N50 does a better, though still imperfect, job of avoiding the dangers inherent in relying on the mean length. In some ways, the actual method you use to calculate N50 does not matter too much, as long as you use the same method when comparing all assemblies. Back in the day of Sanger-sequence derived assemblies, it was fairly common to see assembly statistics report not just N50, but everything from N10 through to N90. This gives you a much better insight into the overall variation in contig (or scaffold lengths).

In the Assemblathon and Assemblathon 2 contests, we actually plotted all N(x) values to see the full distribution of contig lengths. Except, we didn't use the N50 metric, we used something called NG50. This normalizes for differences in overall assembly sizes by asking 'at what contig/scaffold length — from a set of sorted scaffold lengths — do we see a sum length that is greater than 50% of the estimated or known genome size?'

Returning to my earlier fictional example, lets assume that the estimated genome size that was being assembled was 25 Mbp. This means we want to see what contig length takes us past 12.5 Mbp (when summing all contig lengths from longest to shortest):

NG50 contig lengths

DBC1 = 1 Mbp
DBC2 = 1 Mbp
DBC3 = 1 Mbp

We now see parity between all assemblies, at least when assessing their contig lengths. The actual differences in the assemblies mostly reflect variations in which short sequences are included which may, or may not, be of any utility to the end user of the assembly. In my mind, this gives us a much fairer way of comparing the assemblies, at least in terms of their average contig lengths.

NG(X) graphs

Here is an example of an NG(X) chart, that borrows some results from the Assemblathon 2 contest:

Some things to note:

Y-axis (scaffold length) is plotted on a log scale
Each line represents scaffold lengths from a genome assembly
The NG50 value is shown as a dotted line. If you only looked at this you might consider that the red assembly is the best and purple the worst.
The first data point in each series, reveals the longest scaffold length. This is why you can see horizontal lines for some entries. E.g. the longest scaffold for the red assembly is about 70 Mbp. This exceeds the value of NG1, NG2, NG3 etc.
From NG1 to NG50, the relative differences in assemblies is fairly constant. Except that the longest scaffold in the purple assembly is longer than that in the orange assembly. But by NG5, the orange assembly has longer scaffolds.
Around NG75, we start to see changes. The lengths of scaffolds in the blue assembly start to drop off.
At NG90, the lengths of scaffolds in the green assembly plummet. The fact that this series touches the x-axis (at about NG91) tells us that the sum size of this assembly is less than the estimated genome size of the species in question (i.e. you have to have a value at NG100 to have 100% of the estimated genome size present in your assembly).
By NG95, the orange assembly has moved from 5th to 2nd. This just means that the assembly contains a higher proportion of longer scaffolds. However...
Three of the assemblies reach NG100 with positive values on the y-axis. This reveals that these assemblies are all bigger than the estimated genome size. Ideally we could continue this plot to see where they would eventually touch the x-axis. If they reached NG200, then we would know that the genome assembly is twice the estimated genome size.
Hopefully, this shows that you can get a lot of information out of a plot like this.

In summary

N50 is just a better, though still flawed, way of calculating an average sequence length from a set of sequences. It can still be biased, and you really should consider what the estimated/known genome size is (when comparing two or more assemblies of the same species), and/or look to see the full distribution of sequence lengths (e.g. with an NG(x) graph).