R Allan Barker  science | technology | history | philosophy + curiosity
Last update April 28, 2022

Incredibly fast genome assembly and variant calling

Your genome is very long. And, every human has two genomes each with roughly 3 billion DNA bases that could circle the Earth if each were about the size of an ant. However, DNA scanning machines only produce small pieces about 100 bases long. Analyzing just one human genome requires billions of pieces with up to a trillion bases that can take days to assemble and subsequently search for the millions of variants that make you you. What's worse, these millions of variations make the assembly much more difficult, like assembling a jigsaw puzzle where many pieces are blurred, missing, or just wrong.

Illumina, the world's leading DNA scanning equipment maker recently announced a series of new machines that will lead to $100 genomes and process up to one genome per hour. In a recent presentation, Illumina's CEO acknowledged there is currently no way to process genomes is such a short time.

As the old saw goes, the difficult we do immediately, the impossible takes a bit longer. The video below shows the full assembly and variant calling of a full human genome in just minutes.
Video contents in order:

  • Live assembly output as the program runs, true elapsed time at the top.
  • The initial blank background represents a 3 billion base human genome compressed by 1000x.
  • As the assembly runs, the background is marked where data is placed.
  • Placement is uniform except for bright/dark spots of extensive duplication in the human genome.
  • After genome alignment, eight challenging variant calls are shown sequentially.

The assembly window in part two shows:

  • Part two shows the results of 8 challenging alignments.
  • Upper screen:  the alignment of individual 100 base pieces (reads) from the data.
  • Lower screen: the base sequence ACGT by color of a 'standard' human reference genome.
  • Mid screen: one (or two) sets of ACGT base comparisons between human and the standard reference.
  • Dual comparisons usually show two different homologs in 'this' person's genome wrt the reference.
  • Humans have 2 sets of genes (homologs) both may differ from each other and from the reference.

Video: fullscreen recommended





Discussion

The software runs on ordinary PC/GPU computer hardware and can be easily implemented anywhere at low cost. The underlying algorithm can be described as an indexed neural network. The same method is applicable to a number of computational problems where the data can be described in terms of features.

The novel aspect of applying this method to genome assembly is the ability to fuse the alignment and variant calling steps into a single application thus eliminating a multitude of intermediate data processing work.


Example variant call

A dual homolog indel (insert, deletion) variant.