Shaun Jackman added a comment to my previous post about the ongoing development of a new format by which to represent genome assemblies. I thought I would reproduce this in a separate blog post in order to bring this issue to more attention.
But first, a quick reminder that currently nearly all genome assemblies are ultimately stored as DNA sequences in FASTA format. This format was developed over 25 years ago and is not best suited to representing a genome assembly.
One obvious reason for this is that we commonly sequence the genomes of diploid individuals who have two genomes present in every cell (one derived from each parent). We often know that a particular region of the genome should be represented as sequence X or sequence Y, but the FASTA format requires you to choose one or the other.
There has already been one effort to develop a new file format to best represent the variation present in an assembly, and a final specification was formalized. However, this FASTG format has seemingly not been widely adopted by the community (at least, not that I know of).
At this point, I will simply reproduce Shaun's comment from the earlier post (minor edits made to restructure some of the links and layout):
There has been three fantastic blog posts in the past three months on the topic of devising a common file format for a sequence overlap graph to enable modular assembly pipelines.
Heng Li (@lh3lh3) has proposed a Grapical Fragment Assembly (GFA) file format. An implementation will be included in the next release of ABySS. Jared Simpson (@jaredtsimpson) is working on an implementation for String Graph Assembler (SGA). I hope that other implementations will follow.
- Dear assemblers, we need to talk … together by Páll Melsted (@pmelsted) and Michael R. Crusoe (@biocrusoe). tl;dr we need a common file format for contig graphs and other stuff too
- A proposal of the Grapical Fragment Assembly format by Heng Li and…
- First update on GFA by Heng Li
Please add you comments to this posting with your thoughts on the GFA file format.
There are a lot of comments on the two blog posts by Heng and I tweeted my (minor) concerns regarding how this format proposal has developed. This led to some further discussion on twitter, some of which I have storified:
I hope that Heng takes up Shaun's suggestion to move the spec to GitHub. The FASTG proposal used a mailing list to help focus some of the discussion and I feel that something similar needs to happen to ensure that any future debate about the GFA format is productive.