Get shorty: the decreasing usefulness of referring to 'short read' sequences

I came across a new paper today:

When I see such papers, I always want to know 'what do you consider short?'. This particular paper makes no attempt to define what 'short' refers to, with the only mention of the 'S' word in the paper being as follows:

Finally, qAlign stores metadata for all generated BAM files, including information about alignment parameters and checksums for genome and short read sequences

There are hundreds of papers that mention 'short read' in their title and many more which refer to 'long read' sequences.

But 'short' and 'long' are tremendously unhelpful terms to refer to sequences. They mean different things to different people, and they can even mean different things to the same person at different times. I think that most people would agree that Illumina's HiSeq and MiSeq platforms are considered 'short read' technologies. The HiSeq 2500 is currently capable of generating 250 bp reads (in Rapid Run Mode), yet this is an order of magnitude greater than when Illumina/Solexa started out generating ~25 bp reads. So should we refer to these as long-short reads?

The length of reads generated by the first wave of new sequencing technologies (Solexa/Illumina, ABI SOLiD, and Ion Torrent) were initially compared to the 'long' (~800 bp) reads generated by Sanger sequencing methods. But these technologies have evolved steadily. The latest reagent kits for the MiSeq platform offer the possibility of 300 bp reads. However, if you perform paired end sequencing of libraries with insert sizes of ~600 bp, then you may end up generating single consensus reads that approach this length. Thus we are already at the point where a 'short read' sequencing technology can generate some reads that are longer than some of the reads produced by the former gold-standard 'long read' technology.

But the read lengths of any of these technologies pales into comparison when we consider the output of instruments from Pacficic Biosciences and Oxford Nanopore. By their standards, even Sanger sequencing reads could be considered 'short'.

If someone currently has reads that are 500-600 bp in length, it is not clear whether any software tool that proclaims to work with 'short reads' is suitable or not. Just as the 'Short Read Archive' (SRA) became the more-meaningfully-named Sequence Read Archive, so we as a community should banish these unhelpful names. If you develop tools that are optimized to work with 'short' or 'long' read data, please provide explicit guidelines as to what you mean!

To conclude:

There are no 'short' or 'long' reads, there are only sequences that are shorter or longer than other sequences.

101 questions with a bioinformatician #23: Todd Harris

Todd Harris is a Bioinformatics Consultant and Project Manager at WormBase. I first came to know Todd when I was also working on the WormBase project. As part of the UK operation (based at the Sanger Institute), we would frequently refer to him as 'SuperTodd' for his amazing skills at single-handedly keeping the WormBase website updated and working smoothly.

Read More

Details of GFF version 4 have emerged

gff4.png

One of the most widely used file formats in bioinformatics is the General Feature Format (GFF). This venerable tab-delimited format uses 9 columns of text to help describe any set of features that can be localized to a DNA or RNA sequence.

It is most commonly used to provide a set of genome annotations that accompany a genome sequence file, and the success of this format has also spawned the similar Gene Transfer Format (GTF), which focuses on gene structural information.

GFF has been an evolving format, and the widely adopted 2nd version has largely been superceded by use of GFF version 3. This was developed by Lincoln Stein from around 2003 onwards.

As version 3 is now over a decade old, work has been ongoing to develop a new version of GFF 4 that is suitable for the rigors of modern day genomics. The principle change to version 4 will be the addition of a 10th GFF column. This 'Feature ID' column is defined in the spec as follows:

Column 10: Feature ID

Format: FeatureID=<integer>

Every feature in a GFF file should be referenced by a numerical identifier which is unique to that particular feature across all GFF files in existence.

This field will store an integer in the range 1–999,999,999,999,999 (no zero-padding) and identifiers will be generated via tools available from the GFF 4 consortium. If you wish to generate a GFF 4 file, you will need to obtain official sanctioned Feature IDs for this mandatory field.

The advantage of this new field is that all bioinformatics tools and databases will have a convenient way to uniquely reference any feature in any GFF file (as long as it is version 4 compliant)

Large institutions may wish to work with the GFF 4 consortium to reserve blocks of consecutive numeric ranges for Feature IDs

It is intended that the GFF 4 consortium will act as a gatekeeper to all Feature IDs, and that via their APIs you will be able to check whether any given Feature ID exists, and if it does you will be able extract the relevant details of that feature from whatever GFF file in the world contains that specific Feature ID.

Here is an example of how GFF version 4 would describe an intron from a gene:

## gff-version 4
## sub-version 1.02
## generated: 2015-02-01
## sequence-region   chr1 1 2097228       
chrX    Coding_transcript   intron 14192   14266   .   -   gene=Gene00071  FeatureID=125731789

In this example, the intron is the 125,731,789th feature to be registered globally with the GFF 4 consortium. The big advantage of this format is a researcher can now guarantee that this particular Feature ID will not exist in any other GFF file anywhere in the world. The use of unique identifiers like this will be a huge leap forward for bioinformatics as we will no longer have to worry about lines in our GFF files possibly existing in someone else's GFF files as well.

Update: check the date

Bogus bioinformatics acronyms…there's a lot of them about

Time for some new JABBA awards to recognize the ongoing series of crimes perpetrated in the name of bioinformatics. Two new examples this week…

 

Exhibit A (h/t @attilacsordas): from arxiv.org we have…

CoMEt derives from 'Combinations of Mutually Exclusive Alterations'. Of course the best way of making it easy for people to find your bioinformatics tool is to give it an identical name as an existing tool which does something completely different. So don't be surprised if you search for the web for 'CoMEt' only to find a bioinformatics tool called 'CoMet' from 2011 (note the lower-case 'e'!). CoMet is a web server for comparative functional profiling of metagenomes.

 

Exhibit B: from the journal Bioinformatics — the leading provider of bogus bioinformatics acronyms since 1998 — we have…

MUSCLE is derived from 'Multi-platform Unbiased optimization of Spectrometry via Closed-Loop Experimentation'. Multi-platform you say? What platforms would those be? From the paper:

MUSCLE is a stand-alone desktop application and has been tested on Windows XP, 7 and 8

What, no love for Windows Vista?

Of course, it should be obvious to anyone that this bioinformatics tool called MUSCLE should in no way be confused with the other (pre-existing) bioinformatics tool called MUSCLE.

Is Amazon's new 'unlimited' cloud drive suitable for bioinformatics?

Amazon have revealed new plans for their cloud drive service. Impressively, their 'Unlimited Everything' plan offers the chance to store an unlimited number of files and documents for just $59.99 per year (after a 3-month free trial no less).

News of this new unlimited storage service caught the attention of more than one bioinformatician:

If you didn't know, bioinformatics research can generate a lot of data. It is not uncommon to see individual files of DNA sequences stored in the FASTQ format reach 15–20 GB in size (and this is just plain text). Such files are nearly always processed to remove errors and contamination resulting in slightly smaller versions of each file. These processed files are often mapped to a genome or transcriptome which generates even more output files. The new output files in turn may be processed with other software tools leading to yet more output files. The raw input data should always be kept in case experiments need to be re-run with different settings so a typical bioinformatics pipeline may end up generating terabytes of data. Compression can help, but the typical research group will always be generating more and more data, which usually means a constant struggle to store (and backup) everything.

So could Amazon's new unlimited storage offer a way of dealing with the common file-management headache which plagues bioinformaticians (and their sys admins)? Well probably not. Their Terms of Use contain an important section (emphasis mine):

3.2 Usage Restrictions and Limits. The Service is offered in the United States. We may restrict access from other locations. There may be limits on the types of content you can store and share using the Service, such as file types we don't support, and on the number or type of devices you can use to access the Service. We may impose other restrictions on use of the Service.

You may be able to get away with using this service to store large amounts of bioinformatics data, but I don't think Amazon are intending for it to be used by anyone in this manner. So it wouldn't surprise me if Amazon quietly started imposing restrictions on certain file types or slowing bandwidth for heavy users such that it would make it impractical to rely on for day-to-day usage.

Google and WormBase: these are not the search results you're looking for

Today I wanted to look up a particular gene in the WormBase database. Rather than go to the WormBase website, I thought I would just search Google for the word 'wormbase' followed by the gene name (rpl-22). Surely this would be enough to put the Gene Summary page for rpl-22 at the top of the results?

Sadly no. Here are the results that I was presented with:

All ten of these results include information from the WormBase database regarding the rpl-22 gene and/or link to the WormBase page for the gene. But there are no search results for wormbase.org at all.

Very odd. Is WormBase not allowing themselves to be indexed by search engines? I see a similar lack of wormbase.org results when using bing, Ask, or DuckDuckGo. However, if I search Google for flybase rpl-22 or pombase rpl-22 I find the desired fly/yeast orthologs of the worm rpl-22 gene as the top Google result.

Regarding the current state of bioinformatics training

Todd Harris (@tharris) is on a bit of a roll at the moment. Last month I linked to his excellent blog post regarding community annotation, and today I find myself linking to his latest blog post:

Todd makes a convincing argument that bioinformatics education has largely failed, and he lists three reasons for this, the last of which is as follows:

Finally, the nature of much bioinformatics training is too rarefied. It doesn’t spend enough time on core skills like basic scripting and data processing. For example, algorithm development has no place in a bioinformatics overview course, more so if that is the only exposure to the field the student will have.

I particularly empathize with this point. There should be a much greater emphasis on core data processing skills in bioinformatics training, but ideally students should be getting access to some of these skills at an even earlier age. Efforts such as the Hour of Code initiative are helping raise awareness regarding the need to teach coding skills — and it's good to see the President join in with this — but it would be so much better if coding was part of the curriculum everywhere. As Steve Jobs once said:

"I think everybody in this country should learn … a computer language because it teaches you how to think … I view computer science as a liberal art. It should be something that everybody learns, takes a year in their life, one of the courses they take is learn how to program" — Steve Jobs, 1995.

Taken from 'Steve Jobs: The Lost Interview

Maybe this is still a pipe dream, but if we can't teach useful coding skills for everyone, we should at least be doing this for everyone who is considering any sort of career in the biological sciences. During my time at UC Davis, I've helped teach some basic Unix and Perl skills to many graduate students, but frustratingly this teaching has often come at the end of their first year in Grad School. By this point in their graduate training, they have often already encountered many data management problems and have not been equipped with the necessary skills to help them deal with those problems.

I think that part of the problem is that we still use the label 'bioinformatics training' and this reinforces the distinction from a more generic 'biological training'. It may once have been the case that bioinformatics was its own specialized field, but today I find that bioinformatics mostly just describes a useful set of data processing skills…skills which will be needed by anybody working in the life sciences.

Maybe we need to rebrand 'bioinformatics training', and use a name which better describes the general importance of these skills ('Essential data training for biologists?'). Whatever we decide to call it, it is clear that we need it more than ever. Todd ends his post with a great piece of advice for any current graduate students in the biosciences:

You should be receiving bioinformatics training as part of your core curriculum. If you aren’t, your program is failing you and you should seek out this training independently. You should also ask your program leaders and department chairs why training in this field isn’t being made available to you.


Thinking of naming your bioinformatics software? Be afraid (of me), be very afraid (of me)

Saw this tweet today by Jessica Chong (@jxchong):

I like the idea that people might be fearful of choosing a name that could provoke me into writing a JABBA post. If my never-ending tirade about bogus bioinformatics acronyms causes some people to at least think twice about their intended name, then I will take that as a minor victory for this website!

More madness with MAPQ scores (a.k.a. why bioinformaticians hate poor and incomplete software documentation)

I have previously written about the range of mapping quality scores (MAPQ) that you might see in BAM/SAM files, as produced by popular read mapping programs. A very quick recap:

  1. Bowtie 2 generates MAPQ scores between 0–42
  2. BWA generates MAPQ scores between 0–37
  3. Neither piece of software describes the range of possible scores in their documentation
  4. The SAM specification defines the possible ranges of the MAPQ score as 0–255 (though 255 should indicate that mapping quality was not available)
  5. I advocated that you should always take a look at your mapped sequence data to see what ranges of scores are present before doing anything else with your BAM/SAM files

So what is my latest gripe? Well, I've recently been running TopHat (version 2.0.13) to map some RNA-Seq reads to a genome sequence. TopHat uses Bowtie (or Bowtie 2) as the tool to do the intial mapping of reads to the genome, so you might expect it to generate the same range of MAPQ scores as the standalone version of Bowtie.

But it doesn't.

From my initial testing, it seems that the BAM/SAM output file from TopHat only contains MAPQ scores of 0, 1, 3, or 50. I find this puzzling and incongruous. Why produce only four MAPQ scores (compared to >30 different values that Bowtie 2 can produce), and why change the maximum possible value to 50? I turned to the TopHat manual, but found no explanation regarding MAPQ scores.

Turning to Google, I found this useful Biostars post which suggests that five MAPQ values are possible with TopHat (you can also have a value of 2 which I didn't see in my data), and that these values correspond to the following:

  • 0 = maps to 10 or more locations
  • 1 = maps to 4-9 locations
  • 2 = maps to 3 locations
  • 3 = maps to 2 locations
  • 50 = unique mapping

The post also reveals that, confusingly, TopHat previously used a value of 255 to indicate uniquely mapped reads. However, I then found another Biostars post which says that a MAPQ score of 2 isn't possible with TopHat, and that the meaning of the scores are as follows:

  • 0 = maps to 5 or more locations
  • 1 = maps to 3-4 locations
  • 3 = maps to 2 locations
  • 255 = unique mapping

This post was in reference to an older version of TopHat (1.4.1) which probably explains the use of the 255 score rather than 50. The comments on this post reflect some of the confusion over this topic. Going back to the original Biostars post, I then noticed a recent comment suggesting that MAPQ scores of 24, 28, 41, 42, and 44 are also possible with TopHat (version 2.0.13).

As this situation shows, when there is no official explanation that fully describes how a piece of software should work, it can lead to mass speculation by others. Such speculation can sometimes be inconsistant which can end up making things even more confusing. This is what drives bioinformaticians crazy.

I find it deeply frustrating when so much of this confusion could be removed with better documentation by the people that developed the original software. In this case the documentation needs just one paragraph added; something along the lines of…

Mapping Quality scores (MAPQ)
TopHat outputs MAPQ scores in the BAM/SAM files with possible values 0, 1, 2, or 50. The first three values indicate mappings to 5, 3–4, or 2 locations, whereas a value of 50 represents a unique match. Please note that older versions of TopHat used a value of 255 for unique matches. Further note that standalone versions of Bowtie and Bowie 2 (used by TopHat) produce a different range of MAPQ scores (0–42).

Would that be so hard?

New paper provides a great overview of the current state of genome assembly

The following paper by Stephen Richards and Shwetha Murali has just appeared in the journal Current Opinion in Insect Science:

Best practices in insect genome sequencing: what works and what doesn’t

In some ways I wish they had chosen a different title as the focus of this paper is much more about genome assembly than genome sequencing. Furthermore, it provides a great overview of all of the current strategies in genome assembly. This should be of interest to any non-insect researchers interested in the best way of putting a genome together. Here is part of the legend from a very informative table in the paper:

Table 1 — De novo genome assembly strategies:
Assembly software is designed for a specific sequencing and assembly strategy. Thus sequence must be generated with the assembly software and algorithm in mind, choosing a sequence strategy designed for a different assembly algorithm, or sequencing without thinking about assembly is usually a recipe for poor un-publishable assemblies. Here we survey different assembly strategies, with different sequence and library construction requirements.