In a recent blog post, Lex Nederbragt explains why we all need to be moving to graph-based representations of genome sequences (and getting away from linear representations).
In this post I will provide some background, explain the reasons for moving towards graph-based representations, and indicate some challenges associated with this development.
After listing the many challenges involved in moving towards a graph-based future, he refers to the fact that current efforts have not been widely adopted:
Two file formats to represent the graph been developed: FASTG and GFA. FASTG has limited uptake, only two assembly programs (ALLPATHS_LG and SPAdes) will output in that format. GFA parsing is currently only experimentally in the ABYSS assembler, and [the VG program] is able to output it.
The lack of a widely recognized and supported standard for representing variation inherent in a genome sequence is, in my opinion, a major barrier to moving forward. Almost all bioinformatics software that works with genome sequence data expects a single sequence with no variation. It will require a whole new generation of tools to work with a variant-based format, but tool developers will be reluctant to write new code if there is no clear agreement on what is the new de facto file format.