My thoughts as an enterprise Java developer.

Tuesday, January 29, 2008

Explaing zero-based arrays

Sometimes people have a hard time understand zero-based arrays but I just realized that ages are zero-based.

The first year of life someone is 0 years old.
The 2nd year of life someone is 1 year old.
Etc.

Friday, January 25, 2008

Genome assembler

I have been listening to "The Genome War" and it seems like the the difficulty of assembling the pieces from the Shotgun sequencing isn't has hard as described in the book.

Data points:
Base pairs in the Human Genome: 3 billion.
Different types of base pairs: 4
Sequence length: 2,000
Maximum number of sequences per whole genome: 1,500,000
Number of reads to get coverage: 12
Total number of sequences to assemble: 18,000,000
Total ends to match: 36,000,000
Length of an end: 50 base pairs (13 bytes can store 52 base pairs)

Technique:
So if you made a database table with a column to store the end code (from the outside) and a sequence column to store the sequence that for that end, you would have 2 rows per sequence for a total of 36,000,000 rows. Each row would have 513 bytes of info for a total of 17 GB of data.
Then make a table called End48 with a Key48 column for 48 base pairs (12 bytes) and a Value52 for 52 base pairs (13 bytes).
Fill that table so that each end in the main table is put in the Value52 column and and the first 48 base pairs of that end are put in the Key48 column.
Continue stripping down the end base pairs into tables End44, End40, End36, ..., End8, End4. (You actually only have to go as far down as the DB can easily sort the key column -- I think current DBs wouldn't need any End tables).
Then you can start by sorting the key column of the smallest End table and continue until all of the sequences are sorted by their end pieces. Finally go through the sorted results and join any sequences that have the same end value. A lot of this processing could occur as the data comes in so you wouldn't have to wait until the end to do all of the computing.

This assumes that sequences always end at the same point which I am not sure is true. If that isn't true then you might end up with multiple copies of the genome in the final assembly stage but that would not be a problem.

That technique could be easily done with a few thousand dollars of hardware today and I assume it would not have taken a super-computer in 1999-2000.

So what am I missing because apparently the problem was much more difficult to solve?

Reference: http://en.wikipedia.org/wiki/Human_genome_project

Tuesday, January 22, 2008

What kind of programming work do you like best?

When interviewing candidates I would ask them the following question:
What kind of programming work do you like best?

They always (3-4 interviews) gave then same answer: Creating new software.
Since that isn't the answer that I would give I asked my co-workers and between 4 of us we had 3 different answers:
  1. Creating something new
  2. Debugging problems.
  3. Improving (adding features, improving performance, etc).
Would you give one of those answers? Do you have a different answer?

Wednesday, January 02, 2008

Zooming to better use desktop space

On a real-life desktop I will often move toward an object to work with it. i.e. When I want to work with the calendar I will move closer to it. That allows better use of my desktop because some objects are "zoomed out" and therefore take less space (in my field of view).

Has anyone ever tried to incorporate a similar idea into a GUI. Icons, minimize/maximize, and resizing are somewhat like that but not quite enough. It would be nice to resize something to a "zoomed out size". i.e. I could set a size for my email program when it isn't in use and it would be shrunk to fit in that space when desired.