Genome Graphs and Genome Assembly
As a first activity update for this OpenCollective, I'd like to describe some of the development acitivty being done this year.
Since work started on BioJulia several years ago, a featureful set of packages is now available.
You can find utilities for working with sequences, alignments, annotations/intervals, databases and more.
Implementing these packages resulted in some consistent frameworks and APIs that make, tasks (for example working with any text based bioinformatics format) consistent and intuitive.
However, what is less prevelant in the BioJulia ecosystem are packages that provide higher-level, task oriented frameworks.
This year I am aiming to change that with GenomeGraphs.jl. This package will provide a library with types and methods for manipulating genome assembly graphs.
More than providing some types and methods that are generically useful, it will also provide a framework for working through a high level task or workflow, namely constructing and analysing genome assemblies. GenomeGraphs already allows you to construct a collapsed de-Bruijn graph from raw sequencing read data, and clip erroneous tip structures from said graph using just two high level methods exported by the package.
I hope that this package will represent a change in BioJulia development, now that BioJulia has a lot of packages covering common data types and file formats, packages can start to bring these together to provide productive frameworks. In the future, following GenomeGraphs example, we may have framework packages for such high level workflows as annotating genomes, performing a population genetic analysis, or comparative and structural analyses.
Full documentation of GenomeGraphs can be found here, where you can find a full justification of the package, and a walkthrough of constructing and analysing a genome assembly graph.
I am updating a post here with development notes and discussion with the julia community.
Feedback is welcome, the package is in early development stages, but isn't too far from a limited v0.1.0 already given the aformentioned graph construction functionality. Expect performance to be ok for small genomes and to suck for big ones at present, as in these stages "working and correct" is more important to me at this stage of development than "fastest".