My thoughts as an enterprise Java developer.

Friday, January 25, 2008

Genome assembler

I have been listening to "The Genome War" and it seems like the the difficulty of assembling the pieces from the Shotgun sequencing isn't has hard as described in the book.

Data points:
Base pairs in the Human Genome: 3 billion.
Different types of base pairs: 4
Sequence length: 2,000
Maximum number of sequences per whole genome: 1,500,000
Number of reads to get coverage: 12
Total number of sequences to assemble: 18,000,000
Total ends to match: 36,000,000
Length of an end: 50 base pairs (13 bytes can store 52 base pairs)

Technique:
So if you made a database table with a column to store the end code (from the outside) and a sequence column to store the sequence that for that end, you would have 2 rows per sequence for a total of 36,000,000 rows. Each row would have 513 bytes of info for a total of 17 GB of data.
Then make a table called End48 with a Key48 column for 48 base pairs (12 bytes) and a Value52 for 52 base pairs (13 bytes).
Fill that table so that each end in the main table is put in the Value52 column and and the first 48 base pairs of that end are put in the Key48 column.
Continue stripping down the end base pairs into tables End44, End40, End36, ..., End8, End4. (You actually only have to go as far down as the DB can easily sort the key column -- I think current DBs wouldn't need any End tables).
Then you can start by sorting the key column of the smallest End table and continue until all of the sequences are sorted by their end pieces. Finally go through the sorted results and join any sequences that have the same end value. A lot of this processing could occur as the data comes in so you wouldn't have to wait until the end to do all of the computing.

This assumes that sequences always end at the same point which I am not sure is true. If that isn't true then you might end up with multiple copies of the genome in the final assembly stage but that would not be a problem.

That technique could be easily done with a few thousand dollars of hardware today and I assume it would not have taken a super-computer in 1999-2000.

So what am I missing because apparently the problem was much more difficult to solve?

Reference: http://en.wikipedia.org/wiki/Human_genome_project

No comments: