My thoughts as an enterprise Java developer.

Tuesday, September 08, 2009

Distributed Relational Database Architecture

Distributed Relational Database Architecture

James Stauffer

August 1, 2009

Prepared for CS425 Summer 2009

 

 

Table of Contents

 

        I. Introduction

                A.  Focus

                B.  Implementation

        II.  Relevent Features 

                A.  Availability

                B.  Failover

                C.  Replication

                D.  Partitioning

                E.  Single Point of Failure

                F.  Shared Storage

                G.  Inter-node Communication

        III.  Shared Everything: Oracle RAC (Real Application Cluster

                 A. Maintenance availability

                 B.  Inter-node communication

                 C.  Permanent Storage

                 D.  Client interaction

                 E.  Further information 

        IV:  Shared Nothing:  My SQL Cluster  

                 A. Maintenance availability

                 B.  Inter-node communication

                 C.  Permanent Storage

                 D.  Client interaction

                 E.  Further information                

        V.  Sharding      

                 A. Maintenance availability

                 B.  Inter-node communication

                 C.  Permanent Storage

                 D.  Client interaction

                 E.  Further information                 

         VI.  Conclusion

 

 

 

  1. Introduction
    1. Focus

This paper will cover distributed relational database architecture, primarily as it affects availability to the client.  Features that allow the database to run more complex queries or handle larger data sets (scalability) will not be covered.  All discussion will focus on systems with multiple servers (call nodes) involved.

    1. Implementations

This paper will cover three basic types of implementations: Shared Everything, Shared Nothing, and Sharding.  Shared Everything implementations don't actually share everything but just share the data storage between nodes, while shared nothing systems have nodes that are completely independent (including storage) and self-sufficient.  Systems that present the nodes as one or as interchangeable are generally called clusters and include shared everything and shared nothing.  Sharding is similar to shared nothing in that nothing in the nodes is shared, and the nodes are independent.  However, sharding nodes are seen as completely separate and independent (but related) by the client.  Shared everything systems need to synchronize data access and can do it either through a shared storage device or with direct communication.  Oracle RAC will be the example reviewed for shared everything, and MySQL will be the example reviewed for shared nothing.  Sharding doesn’t have specific support in any major database, so it will be generically reviewed with no example.  For each type of architecture, the following will be addressed: maintenance availability, inter-node communication, permanent storage, and client interaction.

  1. Relevant Features

Some major aspects relevant to highly available distributed relational database architecture include: availability, failover, replication, partitioning, single point of failure, shared storage, and inter-node communication.

    1. Availability

Availability refers to the percent of time that a service is responsive and working correctly (synonymous with uptime).  It is usually expressed in a percent with 99.999% (5 nines) considered a very high level of availability.  Downtime refers to the time that a service is not available (i.e. the opposite of availability).  Both planned and unplanned events can affect availability, therefor both need to be addressed.  All types of database systems try to minimize unplanned downtime to some extent, but vary in how much maintenance can be done without downtime.  Techniques to minimize unplanned downtime include: automatic transferring of connections from failed to working nodes, automatic restart of failed nodes/services, and optimal checkpointing to minimize restart time.  Techniques to minimize planned downtime include supporting maintenance operations while the service running is one of the following: allowing patches, updates, reconfiguration, adding or removing nodes, adding or removing databases.

    1. Failover

Failover refers to what happens when one node fails, and how another node (or nodes) starts handling the service that had been provided by the failed node.  Failover may be mostly transparent to the client, or can require the client to connect to another node.  The client can have a front-end, such as a JDBC driver, that can handle some of the cluster activity (i.e. connecting to another node if the cluster doesn't handle that automatically).  Whether or not the client is automatically redirected to another node, any transactions open on the failed node will also fail.

    1. Replication

Replication refers to the copying of data to another system or node – especially so that when a node fails, no data is lost and the data is immediately available.  Asynchronous replication happens in the background so that the client receives the result of its request much quicker, and does have the risk that the node will fail before all data is replicated.  Synchronous replication happens before the result is sent back to the client, but makes each client request take longer.  There can also be multiple levels of replication. Replication can be done to the memory of two nodes synchronously to provide fast response (and minimilize the chance of data loss): and asynchronous replication can be made to permanent storage.

    1. Partitioning

Partitioning refers to splitting the data into parts, and distributing each part to a separate node.  This allows each node to have less data to handle. One problem,  however, is that it can cause the need for requests to contact many nodes in which to access all needed data.  Therefore, partitioning may not be appropriate for all types of databases.  When a node has less data to handle, it can more effectively cache data in memory, it can have more efficient indexes, and therefore it can increase performance.

    1. Single point of failure

A single point of failure is a single component of the system that is used by the whole system, and if that component fails, the whole system will fail.  A system with a single point of failure is much more at risk for failure, therefor a single point of failure should be avoided.  In a highly available system, as many components as possibleare duplicated to help avoid as many single points of failure as possible.  Even when nodes are duplicates, components on each node are often duplicated to reduce the chance of failure on an individual node (i.e. network interface, storage interface, power supply, etc).  The single point of failure assessment is done on many levels – from the storage system (drives, connectors, controllers, power), all the way up to datacenter power and network connection.

    1. Shared storage

Shared storage is permanent database storage that is shared between all or many nodes.  Like anything else shared, shared storage can be a risk as a single point of failure.  Shared storage can also be used as a communication channel.  Types of shared storage include a database (however, using a database for shared storage of a database isn't used), NAS(Network attached storage), SAN(Storage Area Network), external SCSI disk, cloud storage (such as Amazon S3), etc.  The connection to the shared storage is generally duplicated (i.e. two network cards, SCSI controllers, etc).  The single point of failure risk can be addressed by duplicating the data into two identical shared storage systems through a process called mirroring.

    1. Inter-node communication

Inter-node communication refers to how the nodes communicate with each other.  Nodes can communicate through shared storage, or over a network.  When communication is over a network, it is generally a fast network (gigabit or faster), a network that is private to the nodes (accessed only by the nodes), and a network that is called an interconnect.  Each node generally has two network interfaces for improved fault tolerance. Because the network is slower than memory, the inter-node communication of a cluster can make each action slower in a cluster, as opposed to communication on a single database.

  1. Shared Everything: Oracle RAC (Real Application Cluster)
    1. Maintenance Availability

Adding a new node to the cluster doesn't involve cluster downtime, however clients may not completely use the new node until they are told about it. Some actions, like parallel queries, will immediately take advantage of the new node. All cluster nodes can be managed as one, or managed separately, as deemed necessary. Some code upgrades (patches) to the DBMS (database management system) can be applied to the cluster.  This is done one node at a time, in a rolling fashion, so that all nodes can be upgraded without service downtime.

    1. Inter-node Communication

The nodes communicate updates (for cache), locking, etc. between each other over an interconnect (older versions communicated over the file system).  When a node needs to write to a data block, it first sends a request across the cluster to transfer ownership of the data block to itself. It appears that each cluster has a master that tracks which node owns each block, therefor this design doesn't slow a single update as nodes are added to the cluster because ownership transfer only involves threenodes: requester, master, and current owner.  Since only one node can own a block at one time, blocks that are updated often can cause ownership to jump around between nodes, often degrading performance.

    1. Permanent Storage

Oracle RAC uses shared storage.  The shared storage can be NAS, SAN, or SCSI disk.  ASM (Automatic Storage Management) can address the storage single point of failure risk by mirroring data across different failure groups (a set of disks sharing something in common, such as a controller).  Since the file system is shared, all volume management must be designed to work with the cluster.

    1. Client Interaction

Clients connect to the nodes with Virtual IP (VIP) addresses so that if a node fails, the VIP can be redirected to another node. The client needs to know the VIP for all nodes. Clients can use Fast Application Notification (FAN), Fast Connection Failover (FCF), and Transparent Application Failover (TAF) to detect and/or handle node failure.  However, it may require the client to re-do some work, and it may be hard to determine which options are current, and which options will work best.  It can be a huge code change to change a program to detect and redo actions every place that the database is used.  Load balancing is supported by the client having the list of all nodes, and randomly choosing a node for a connection (so that connections are spread across nodes).  Alternatively, the client can get load information from the listener running on the chosen node.  This is done so that the listener can direct the client to another node that has more resources available (which all depends on the connection option chosen.)

    1. Further Information

http://en.wikipedia.org/wiki/Oracle_RAC

http://www.oracle.com/technology/products/database/clustering/index.html

  1. Shared Nothing: MySQL Cluster

A MySQL cluster has 3 node types: data nodes to store the data, SQL nodes to run a MySQL server, and management nodes.  Therefore a minimum of five nodes is generally needed for high availability (1 management, 2 SQL, and 2 data each with 1 replica of the data).

    1. Maintenance Availability

Adding and dropping data nodes requires that the cluster be restarted.  One exception is that data nodes for new partitions can be added while the cluster is running.  If one node fails, failover automatically happens to another node, and any transaction information on the failed node is lost.  The failover only takes sub-second time to happen.  Adding space to an existing database by adding a data node, requires that the data be repartitioned to include the new node (so that the data is spread out over the new set of nodes). Therefore, adding and repartitioning causes downtime and uses a lot of resources. Rolling software updates are supported.

    1. Inter-node Communication

Nodes communicate on a private interconnect.  Because there is only one primary owner of each block (the primary replica), ownership of the block doesn’t need to be transferred, and updates only have to involve 2 nodes (the SQL node processing the request, and the master data node that owns the block).

    1. Permanent Storage

The data is partitioned across the data nodes.  Each piece of data can have multiple replicas (with one node being the master, and the others being slaves) so that the data exists on multiple nodes, and so that there is no single point of failure.  When a node needs to write a block, it can replicate the data either asynchronously or synchronously. If the node is configured to replicate asynchronously, it first replicates the data to all other data nodes that have a replica, and then asks them if they can commit the change.  If all reply affirmatively,the node then sends another message to tell all other nodes to commit the change (two phase commit).  Since the data is replicated to other nodes, each node can replicate to permanent storage asynchronously.

    1. Client Interaction

Clients can connect to the cluster through a load balancer, or through a proxy, so that they don't have to be aware of the individual SQL nodes.  Client reads can be done on the replicated nodes to provide better performance because the access can be spread out over more data nodes.  Read and write lock conflicts can also be reduced if the masters are setup to only handle writes, and the slaves only handle reads. The reduced contention also increases performance.

    1. Further Information

http://en.wikipedia.org/wiki/MySQL_Cluster

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster.html

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication.html

http://dev.mysql.com/doc/refman/5.1/en/replication.html

  1. Sharding
    1. Maintenance Availability

Since each shard is independent, all maintenance on one shard has no affect on the rest of the shards.  The only exception is if the data needs to be repartitioned across the shards, then all shards can be affected.  However, repartitioning can be done without affecting availability.  Adding a shard can also require repartitioning.  Sharding takes more work to manage because of the work involved in repartitioning to keep the load balanced, or when adding or removing a shard.  Each shard could have many of the techniques for normal clusters used, but they are used independently, so any bottlenecks are for a smaller data set.  Balancing the load between servers can be difficult.  For example, some users may use more resources, thus using some shards much more than others.

    1. Inter-node Communication

There is no inter-node communication with sharding, and problems on one server can't affect other servers.

    1. Permanent Storage

Sharding splits the data into partitions/shards (i.e. by groups of users) and puts each portion on a completely separate DB system.  This removes write bottlenecks across shards without the potential for data inconsistency.  There is no specific storage architecture since each shared can use any storage architecture.

    1. Client Interaction

The client either has to know the sharding algorithm (so that the client knows which shard to contact), or there must be a shard lookup service that the client uses.  The shard lookup service will have the potential to be both a single point of failure, and a bottleneck (however, it should be used much less so that it isn't a bottleneck).  Determining the correct shard to contact can increase the complexity of the client code.  Also, queries across groups are more difficult, therefore, systems that require a lot of queries across shards may not work well with a sharding system.

    1. Further Information

http://highscalability.com/unorthodox-approach-database-design-coming-shard

  1. Conclusion

Sharding has the highest maintenance availability, MySQL cluster has the lowest maintenance availability.  Sharding has the lowest inter-node communication, Oracle cluster has the highest inter-node communication.  MySQL cluster has the most fault-tolerant permanent storage, sharding has the least fault-tolerant.  MySQL cluster has the least complex client interaction, Oracle RAC has the most complex.  Depending on the exact features needed, and the type of data used, each could be the best solution  However, for this focused comparison, MySQL cluster provides the best combination.

Friday, August 28, 2009

Bill would give president emergency control of Internet | Politics and Law - CNET News

Bill would give president emergency control of Internet | Politics and Law - CNET News: "The new version would allow the president to 'declare a cybersecurity emergency' relating to 'non-governmental' computer networks and do what's necessary to respond to the threat. Other sections of the proposal include a federal certification program for 'cybersecurity professionals,' and a requirement that certain computer systems and networks in the private sector be managed by people who have been awarded that license."

Why should we expect the executive branch to have the expertise and information to know how to best use that control?

Tuesday, August 04, 2009

Defeating Character Frequency Analysis of Substitution Ciphertext

(This is a paper I wrote in the fall of 2008 for CS461: Computer Security I at UIUC. If you are interest in the code leave a comment and contact info.)

Introduction


Character frequency analysis is a common way to crack substitution ciphertext because of the uneven distribution of letters in languages. In order to defeat that analysis, the techniques considered were: synonyms, dropping letters or words, adding extra letters or words, alternate spelling (Leet, spam, etc), and Two cipher letters per plaintext letter. Most of these techniques would be applied before a standard substitution cipher encryption. To evaluate the techniques, test data was obtained from Project Gutenberg(http://www.gutenberg.org/) and consisted of the texts of Psalms (Ps), Pride and Prejudice (P&P), Alice in Wonderland (AiW), and Sherlock Holmes (SH). After applying the technique, the output character frequency was compared to the standard character frequency (http://en.wikipedia.org/wiki/Letter_frequencies and http://www.data-
compression.com/english.html
) and then the standard deviation of the proportion of the changed frequency for each letter compared to the standard character frequency. i.e. If the letter "e" occurred 10% of the time in the changed text then the proportion for "e" is .787 (.10/.12702). That gives an overall measure of how close the changed text character frequency is to the standard character frequency. A standard deviation of 0% would mean that it exactly matches the standard frequency. Initial standard deviations are: Ps: 25%, P&P: 28%, AiW: 25%, SH: 14%. For comparison, an equal distribution produces a standard deviation of 1,439%. In LetterFrequencies.zip there are programs for most of these options and for computing the standard deviation.

Synonyms


Replacing words with synonyms that have a higher standard deviation can help defeat character frequency analysis. The thesaurus used has synonyms in categories, so when a word was looked up, the first category was arbitrarily used. Within the first category, the word that had the largest standard deviation from the normal character frequency was chosen. The chosen work often didn't fit grammatically, so the results are skewed, but this is somewhat offset by only looking in the first category. To make this usable, grammar checking would have to be used. It would probably need a UI that would suggest alternatives (based on grammar rules and the standard deviation). Attacks against this technique could include creating a list of the best synonyms, creating a new standard character frequency distribution for those words, and therefore largely defeat the benefit of this technique. Final standard deviation achieved: Ps: 182%, P&P: 180%, AiW: 135%, SH: 149%. The increase in standard deviation was very significant, but not enough to make a large impact on frequency analysis.

Dropping letters or words


Randomly dropping letters or words could easily hide the meaning from the intended recipient so it would be hard to keep the meaning while still randomly dropping enough to make an impact. An attempt to always drop vowels and rare letters (a,e,i,o,u,x,q) and words (I, the, is, was, of, a, & an) made an insignificant difference (Drop Words Ps: 8%, P&P: 6%, AiW: 8%, SH: 8%; Drop Letters: Ps: 73%, P&P: 91%, AiW: 69%, SH: 70%). Also, some of each letter would need to be dropped to prevent the non-dropped letters from serving as landmarks.

Adding extra letters or words


To be effective, extra letters and words must be randomly placed. They could be chosen based on the inverse of distribution. However, additions that have a significant impact on the standard deviation may garble the message too much. Add letters results(based on inverse of distribution): P&P: 2,154%, SH: 2,795%. The result is readable but takes significantly more work to understand the message.

Alternate spelling (Leet, spam, etc)


Alternate spelling includes l33t (http://en.wikipedia.org/wiki/Leet) and spam spelling (http://en.wikipedia.org/wiki/E-mail_spam#Obfuscating_message_content) and is similar to synonyms, in that the possibility of a new new standard character frequency has to be taken into account. This technique was not implemented or programatically evaluated because it was non-obvious what to use as the standard distribution.

Two cipher letters per plaintext letter


Using two letters in the cipher text for each letter in the plaintext can be a good way to create a flat character distribution. The algorithm is to partition the 676 2-letter combinations based on the standard character frequency. i.e. if the standard frequency for a letter is 5% then it will get 5% of the 2-letter combinations (randomly selected). This doubles the size of the data, could include spaces & punctuation, and makes a much larger key. Note that some letters may get dropped because they occur less than 1/676 (0.15%) of the time. Both 1-gram and 2-gram frequency analysis produce a nearly uniform histogram (variation appears to only be caused by rounding). Two-gram results: P&P=5,117%; SH=5,013%. Therefore this technique was extremely effective with no obvious weaknesses.

Conclusion


Changing the plaintext proved problematic because of inadvertent changes to the meaning, but it did make a significant impact on standard deviation even though it probably wasn't significant enough. Applying all of the plaintext changes to a short message while using safeguards to protect the meaning would probably be effective. Two cipher letters per plaintext letter appears to be the easiest way to defeat character frequency analysis.

Friday, July 31, 2009

Sample chapter from Don't Make Me Think

Sample chapter from Don't Make Me Think: "Why are things always in the last place you look for them?
Because you stop looking when you find them.
—Children’s riddle"

Great info on usability.

Thursday, July 30, 2009

Pressure sensitive keyboard for repeat rate

Would it be helpful to have a pressure sensitive keyboard where the pressure determines the repeat rate (harder=faster, etc)? It seems that it would be especially useful for navigation keys (arrow, page up/down) and keys commonly repeated (enter, backspace, delete, space, tab). Does such a thing exist? Was it tried and how well did it work?

Wednesday, July 29, 2009

Ex-Google CIO breaks his own security rules | InSecurity Complex - CNET News

Ex-Google CIO breaks his own security rules | InSecurity Complex - CNET News: "But 'it's not security's responsibility to go out there and say 'Users want to use Gmail. Let them use it,'' Johnson added. 'If we decide to use Gmail we need to have a project and treat it in a formal way and pay money to do it right.'"

So are you going to have a project for every website that users might find useful? There is no way that you will keep up and the barrier to using useful websites will significantly hurt productivity.

Beyond the hype: Where open source actually saves you money | The Open Road - CNET News

Beyond the hype: Where open source actually saves you money | The Open Road - CNET News: "One way, as Urlocker points out on his blog, is that open source allows enterprise IT projects to succeed or fail with little risk. You know before you pay anything--if you pay anything--that open-source software is going to work, or not."

"Open source tends to offer best-of-breed solutions that aim to do a limited range of functions well, rather than to be all things to all people."

Saturday, July 25, 2009

MIT OpenCourseWare | Electrical Engineering and Computer Science | 6.854J Advanced Algorithms, Fall 2008 | Home

MIT OpenCourseWare |


Electrical Engineering and Computer Science | 6.854J Advanced Algorithms, Fall 2008 | Home
: "This is a graduate course on the design and analysis of algorithms, covering several advanced topics not studied in typical introductory courses on algorithms. It is especially designed for doctoral students interested in theoretical computer science."

Monday, June 22, 2009

Pattern Formatter for java.util.logging

Pattern Formatter for java.util.logging: "The Log Formatter will use the following formatting tokens:

* LoggerName %LOGGER%
* Level %LEVEL%
* Time %TIME%
* Message %MESSAGE%
* SourceClassName %SOURCECLASS%
* SourceMethodName %SOURCEMETHOD%
* Exception Message %EXCEPTION%"

Thursday, May 07, 2009

Save Icon

The most common save icon is a picture of a floppy disk.

But with floppy disks not being used anymore will most programs change the picture? In 1-2 decades what picture will be used for save icons?

Thursday, April 16, 2009

FBI seizures highlight law as cloud impediment | The Wisdom of Clouds - CNET News

FBI seizures highlight law as cloud impediment | The Wisdom of Clouds - CNET News: "The articles report that the FBI raided at least two Texas data centers last week, serving search and seizure warrants for computing equipment, including servers, routers, and storage. The FBI was seeking equipment that may have been involved in fraudulent business practices by a handful of small VoIP vendors.
The problem is that they didn't just grab the systems belonging to the VoIP vendors, but grabbed hundreds of servers serving a wide variety of businesses, the vast majority of which had never dealt with or even heard of the companies under investigation, according to Threat Level. Company officials interviewed complained of losing millions of dollars in lost revenue and equipment with no warning whatsoever."

"If the court upholds that servers can be seized despite no direct warrants being served on the owners of those servers (or the owners of the software and data housed on those servers), then imagine what that means for hosting your business in a cloud shared by thousands or millions of other users."

"Here is what I argue must happen:
The law must respect digital assets in the same way that it respects physical assets. This means that search and seizure rules should apply to data and software run on third party infrastructure (or wholly owned infrastructure run in third party facilities) in the same way that they protect my home and personal property if I rent an apartment in a building housing hundreds of tenants. The fact that one tenant commits a crime is not enough for the civil liberties of all of the other tenants to be null and void. I argue the same goes for digital assets "renting" space in the cloud.
The federal government should adopt a cloud computing bill of rights. (Here is a rudimentary example.) Each state should as well. Declare loud and clear that you suffer little or no loss of rights if you choose to run your business in the cloud over running it within your own facilities. Repeal or revise the laws that make it impossible for foreign businesses and governments to allow communications and data to pass within U.S. borders (including relevant elements of the Patriot Act).
It is time for our policy makers to step up and really understand the influence that the Internet and cloud computing will have on the future growth of this country. It is scary how little technical understanding most congressional and senate members have. However, that alone is not an excuse for not grasping the policy gaps that are brought about as our commerce and society rely increasingly upon Internet-based services."

Wednesday, January 28, 2009

Caching strategy

Generally Least Recently Used (LRU) is considered a good strategy for removing items from the cache. Instead it might be useful to measure the cost (i.e. time) to produce a cache item, and how many hits that item gets and then remove the item that has the lowest value of hits * cost.

A quick Google search didn't find anything like this. Comments?

Saturday, November 01, 2008

Lego Mindstorm NXT program

I temporarily had a Lego Mindstorm NXT and wrote the following program for it using LeJOS.  Enjoy:
import java.util.*;
import lejos.nxt.*;
import lejos.util.*;

public class Dingbot {
public static void main (String[] aArg) throws Exception {
new ButtonWatcherThread(new TimerListener() {
public void timedOut() {
running = !running;
}
}).start();
motorRight.smoothAcceleration(true);
motorLeft.smoothAcceleration(true);
final float circumference = 17.5f;
final int distanceBuffer = 50;
final int rotateDistance = 80;
UltrasonicSensor distSensor = new UltrasonicSensor(SensorPort.S4);
LightSensor litSensor = new LightSensor(SensorPort.S2);
litSensor.setFloodlight(true);

LCD.drawString("Hi",1,1);
int startTime = (int)System.currentTimeMillis();
boolean blinkLight = true;
Random random = new Random();
while(startTime + 60*1000 > (int)System.currentTimeMillis()) {
if(running) {
int cmDist = distSensor.getDistance();
while(running && cmDist > distanceBuffer) {
LCD.drawString("   ",1,2);
LCD.drawInt(cmDist,1,2);
int distance = cmDist - distanceBuffer;
if(distance > 5) {
distance = 3;
}
int degrees = (int)(360 * distance / circumference);
setSpeed();
motorRight.rotate(degrees, true);
motorLeft.rotate(degrees, true);
cmDist = distSensor.getDistance();
litSensor.setFloodlight(blinkLight);
blinkLight = !blinkLight;
LCD.drawInt(litSensor.readNormalizedValue(),1,4);
}
boolean turnPositive = random.nextBoolean();
while(running && cmDist <>
LCD.drawString("   ",1,2);
LCD.drawInt(cmDist,1,2);
setSpeed();
if(turnPositive) {
motorRight.rotate(180);
} else {
motorRight.rotate(-180);
}
cmDist = distSensor.getDistance();
}
} else {
try {
Thread.sleep(500);
} catch(InterruptedException ie) {
//Ignore
}
}
}
}

public static void setSpeed() {
int soundLevel = sound.readValue();
LCD.drawString("   ",1,3);
LCD.drawInt(soundLevel,1,3);
int speed = (100 - soundLevel) * 9;
motorRight.setSpeed(speed);
motorLeft.setSpeed(speed);
}
static Motor motorRight = Motor.B;
static Motor motorLeft = Motor.C;
private static SoundSensor sound = new SoundSensor(SensorPort.S1);
private static boolean running = false;
}

class ButtonWatcherThread extends Thread {
public ButtonWatcherThread(TimerListener tl) {
this.tl = tl;
this.setDaemon(true);
}
public void run() {
wasPressedLast = touchSensor.isPressed();
while(true) {
boolean isPressed = touchSensor.isPressed();
if(wasPressedLast != isPressed) {
if(isPressed) {
tl.timedOut();
}
wasPressedLast = isPressed;
}
try {
Thread.sleep(100);
} catch(InterruptedException ie) {
//Ignore
}
}
}
private TimerListener tl;
private boolean wasPressedLast;
private TouchSensor touchSensor = new TouchSensor(SensorPort.S3);
}

Friday, October 31, 2008

Thursday, October 02, 2008

Digital money

I think digital money should have these features:
  1. Hard to counterfit
  2. Hard to duplicate (and use twice)
  3. Easy to transfer to another person with minimal involvement of a 3rd party
  4. Easy to use, track, store, verify, etc
  5. Transaction fees must be very, very low.
  6. Minimal Big Brother tracking.
Here is one idea for how to accomplish that:
A user would go to a website and enter payment info and the website would send the user an image file. The file would be a picture of the amount of money that it represents (for ease of use) and would have comments in it that record the originating website, amount and any other needed info. The website would generate the image so that is unique (i.e. the bits are unique) and would store the image bits and amount in a database. The file would have to be large enough so that a brute force attempt to generate random files would be too expensive.
When the user needs to split an amount (make change), they would send it to the website and request the new amounts. The website would verify that it was valid (in their DB), send new image files to the user, and delete the record for the old file.
When the user needs to send money to another person they would just get the correct amount (using splitting or multiple files) and send those files to the other person. The other person would send them to the website to get their own files so that the old file is invalid and they have a valid file. Once the other person saw that they were valid they would complete the transaction with the first person.
Verification consists of sending to the website to get an new file. If the file was invalid the webiste would just give an error message. Anytime a user receives money from another user they must verify it (get a new file) to make sure that it is valid.
The user could cash out the file by going to the website and requesting payment.
Transaction fees would be limited to the initial payment and requesting payment.

The user could have requested multiple files with various amounts to make it easier.
If someone tries to make a copy of the file in order to use it twice only the first copy would be validated.

The user would need to remember to not make any changes to the image (i.e. rotation) because that would invalidate the file. Therefore the image formats that Windows Explorer can rotate should be avoided or the files always marked as read-only.

This idea could work for trading anything (different currencies, stocks, etc). Each website would have its own file type but websites or users could translate between the different file types.

Comments?

Wednesday, August 06, 2008

Spell check what others write

I think programs should spell and grammer check what others wrote. i.e. If I am reading a webpage it should have some way of marking bad spelling or grammer with a way to see suggested improvements. That would allow me to understand the intent better when the grammer or spelling is poor.

Do you agree that adding this feature to browers, IM clients, etc would be useful?

Monday, June 16, 2008

XSL: Preserving line feeds, tabs, and spaces in data while still wrapping text

I had data in XML that had line feeds, spaces, and tabs that I wanted to preserve (so I couldn't use

) but I also wanted the lines to wrap when the side of the screen was reached (so I couldn't use

).

After some research and help from a co-worker (Patricia Eromosele) I made the XSL template below that achieves that. An example of how to call it follows:


<p>
<xsl:call-template name="prewrap">
<xsl:with-param name="text" select="text"/>
</xsl:call-template>
</p>



<xsl:template name="prewrap">
<xsl:param name="text" select="."/>
<xsl:variable name="spaceIndex" select="string-length(substring-before($text, ' '))"/>
<xsl:variable name="tabIndex" select="string-length(substring-before($text, '&#x09;'))"/>
<xsl:variable name="lineFeedIndex" select="string-length(substring-before($text, '&#xA;'))"/>
<xsl:choose>
<xsl:when test="$spaceIndex = 0 and $tabIndex = 0 and $lineFeedIndex = 0"><!-- no special characters left -->
<xsl:value-of select="$text"/>
</xsl:when>
<xsl:when test="$spaceIndex > $tabIndex and $lineFeedIndex > $tabIndex"><!-- tab -->
<xsl:value-of select="substring-before($text, '&#x09;')"/>
<xsl:text disable-output-escaping="yes">&amp;nbsp;</xsl:text>
<xsl:text disable-output-escaping="yes">&amp;nbsp;</xsl:text>
<xsl:text disable-output-escaping="yes">&amp;nbsp;</xsl:text>
<xsl:text disable-output-escaping="yes">&amp;nbsp;</xsl:text>
<xsl:call-template name="prewrap">
<xsl:with-param name="text" select="substring-after($text,'&#x09;')"/>
</xsl:call-template>
</xsl:when>
<xsl:when test="$spaceIndex > $lineFeedIndex and $tabIndex > $lineFeedIndex"><!-- line feed -->
<xsl:value-of select="substring-before($text, '&#xA;')"/>
<br/>
<xsl:call-template name="prewrap">
<xsl:with-param name="text" select="substring-after($text,'&#xA;')"/>
</xsl:call-template>
</xsl:when>
<xsl:when test="$lineFeedIndex > $spaceIndex and $tabIndex > $spaceIndex"><!-- two spaces -->
<xsl:value-of select="substring-before($text, ' ')"/>
<xsl:text disable-output-escaping="yes">&amp;nbsp;</xsl:text>
<xsl:text disable-output-escaping="yes">&amp;nbsp;</xsl:text>
<xsl:call-template name="prewrap">
<xsl:with-param name="text" select="substring-after($text,' ')"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise><!-- should never happen -->
<xsl:value-of select="$text"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

Tip: Java thread dumps on linux

To get a thread dump in Java on linux run "pkill -3 java"

Ethical programming

It should be noted that no ethically-trained software engineer would ever consent to write a DestroyBaghdad procedure. Basic professional ethics would instead require him to write a DestroyCity procedure, to which Baghdad could be given as a parameter.
-- Nathaniel Borenstein
(Passed on by Reid Nimz.)

Friday, February 15, 2008

Rename multiple files

Options for renaming multiple files on Windows
  1. Windows explorer: It only renames them to xxxx (y).ext where the only thing that varies is y (an incrementing number).
  2. Various utilities available for download: I don't like to risk running some unknown program for something that seems so simple.
  3. Command line for in do command. To prepend "3-" to every directory that starts with "60-" run: for /d %i in (60-*) do ren %i 3-%i

Monday, February 11, 2008

Geek Status

In my IM client I set my status to the following. I admin that I'm a geek but it's fun!
echo "JANS" | sed "s/A//" | sed "s/J/Ava/" | sed "s/N/il/" | sed "s/S/able/"

Monday, February 04, 2008

3 paychecks in February

I get paid bi-weekly on Fridays and this year I get 3 paychecks in February. I was wondering how often that happens and with Excel's help I found that it happens every 56 years. I also found that a month with 3 paychecks will be 5-7 months later than the last month with 3 paychecks. In 2010 there will be 3 months with 3 paychecks.

Tuesday, January 29, 2008

Explaing zero-based arrays

Sometimes people have a hard time understand zero-based arrays but I just realized that ages are zero-based.

The first year of life someone is 0 years old.
The 2nd year of life someone is 1 year old.
Etc.

Friday, January 25, 2008

Genome assembler

I have been listening to "The Genome War" and it seems like the the difficulty of assembling the pieces from the Shotgun sequencing isn't has hard as described in the book.

Data points:
Base pairs in the Human Genome: 3 billion.
Different types of base pairs: 4
Sequence length: 2,000
Maximum number of sequences per whole genome: 1,500,000
Number of reads to get coverage: 12
Total number of sequences to assemble: 18,000,000
Total ends to match: 36,000,000
Length of an end: 50 base pairs (13 bytes can store 52 base pairs)

Technique:
So if you made a database table with a column to store the end code (from the outside) and a sequence column to store the sequence that for that end, you would have 2 rows per sequence for a total of 36,000,000 rows. Each row would have 513 bytes of info for a total of 17 GB of data.
Then make a table called End48 with a Key48 column for 48 base pairs (12 bytes) and a Value52 for 52 base pairs (13 bytes).
Fill that table so that each end in the main table is put in the Value52 column and and the first 48 base pairs of that end are put in the Key48 column.
Continue stripping down the end base pairs into tables End44, End40, End36, ..., End8, End4. (You actually only have to go as far down as the DB can easily sort the key column -- I think current DBs wouldn't need any End tables).
Then you can start by sorting the key column of the smallest End table and continue until all of the sequences are sorted by their end pieces. Finally go through the sorted results and join any sequences that have the same end value. A lot of this processing could occur as the data comes in so you wouldn't have to wait until the end to do all of the computing.

This assumes that sequences always end at the same point which I am not sure is true. If that isn't true then you might end up with multiple copies of the genome in the final assembly stage but that would not be a problem.

That technique could be easily done with a few thousand dollars of hardware today and I assume it would not have taken a super-computer in 1999-2000.

So what am I missing because apparently the problem was much more difficult to solve?

Reference: http://en.wikipedia.org/wiki/Human_genome_project

Tuesday, January 22, 2008

What kind of programming work do you like best?

When interviewing candidates I would ask them the following question:
What kind of programming work do you like best?

They always (3-4 interviews) gave then same answer: Creating new software.
Since that isn't the answer that I would give I asked my co-workers and between 4 of us we had 3 different answers:
  1. Creating something new
  2. Debugging problems.
  3. Improving (adding features, improving performance, etc).
Would you give one of those answers? Do you have a different answer?

Wednesday, January 02, 2008

Zooming to better use desktop space

On a real-life desktop I will often move toward an object to work with it. i.e. When I want to work with the calendar I will move closer to it. That allows better use of my desktop because some objects are "zoomed out" and therefore take less space (in my field of view).

Has anyone ever tried to incorporate a similar idea into a GUI. Icons, minimize/maximize, and resizing are somewhat like that but not quite enough. It would be nice to resize something to a "zoomed out size". i.e. I could set a size for my email program when it isn't in use and it would be shrunk to fit in that space when desired.

Monday, December 17, 2007

Telecommuting

I telecommute one day per week and used to telecommute 2 days per week. I would like to telecommute more but my employer isn't very open to doing that (except short-term for a specific project). I don't know of any Java enterprise developers that do major telcommuting (4 days per week or more). I have also not seen very many telecommuting jobs for enterprise Java developers. Why don't more exist?

Technorati Profile

Wednesday, December 12, 2007

Subversion: Moving tags

I wish Subversion had a better way of moving tags. The only way that I know to move a tag is to remove the file from the tag and then copy it again. Revision tree browsers don't seem to handle that very well. This also requires keeping the directory structure under the trunk and tag in sync.

Use case: We have thousands of "maps" and we want to tag which version of each map is the "production" version. We need to be able to easily get the production version of all maps.

Can anyone suggest a better way to address our use case?
I have considered properties also but then we can't get the prod version of all files easily. Merging to the tag doesn't appear to be very easy either.

Monday, November 26, 2007

IntelliJ default class javadoc comment

Why does the default IntelliJ default class javadoc comment use non-standard syntax? Instead of creating a line with "User: jstauffer" it could create a line with "@author jstauffer". The other lines that it creates (Date and Time) probably don't have javadoc syntax to use but why not use the javadoc syntax when available?

For reference here is an example:
/**
* Created by IntelliJ IDEA.
* User: jstauffer
* Date: Nov 13, 2007
* Time: 11:15:10 AM
* To change this template use File | Settings | File Templates.
*/

Naming abstract classes

Normally I see abstract classes named as AbstractClass. But when there are many abstract classes that leads to needing to type at least AbstractC when using code completion. Therefore I suggest that abstract classes be named ClassAbstract so that code completion is more usable. "Abstract" is a modifier on the class name anyway and it doesn't need to be in the primary position. What do you think?



I know this isn't a huge deal -- I am just suggesting it as a minor improvement.

Wednesday, November 21, 2007

Defaults have consequences: Microsoft Word margins

How much paper would be saved each year it Microsoft changed the default margins in Word to 1" on each side (or even 1/2")?

Friday, October 19, 2007

Autoboxing quiz

Does widening or autoboxing happen first? Which param method will be called from the autoboxing method? If one param method is commented out then it will have no problem calling the other.

private static void autoboxing() {
Integer integer = new Integer(2);
param(integer);
}

private static void param(int param) {
System.out.println("param(int " + param + ")");
}

private static void param(Object param) {
System.out.println("param(Object " + param + ")");
}

Thursday, October 18, 2007

Website idea

If you make your fortune with this idea, send me a few bucks.
Make a website where the user can keep a list of things that they need/want to buy.
Allow easy management of that list (through web, WAP, SMS, etc).

Then....
Work with stores so they the user can get a list of all of the items in the list that the given store has. The store could even allow printing the list from a kiosk in the store (registry computers) and would include the location, price, etc. The stores would benefit because people would probably buy more if they remembered what they wanted to buy and they would buy items that they didn't know the given store had. You could then charge the store a small fee so that you wouldn't need ads on the website or user fees.

For stores that don't support that use GPS to determine the store and try to send a list of items that the store has (scrape website?).

Does a website like this exist?

Wednesday, October 17, 2007

Haiku

One connection good.
Then two connections better.
Or maybe not so.

Tuesday, October 09, 2007

Form post with URL paramters (after question mark)

I had a URL that I wanted to POST with JavaScript so a made a

<form name="doForm" action="" method="post">
</form>
and used JavaScript to set the action and call submit. But Tomcat showed the request as a GET. I found that adding a dummy form input fixed the problem:

<form name="doForm" action="" method="post">
<input type="hidden" name="blah" value="blah"/><!-- need at least 1 input or the form uses method="get" even when "post" is specified. -->
</form>

Exception plan

In projects on which I have worked I find that a few common Exceptions are used over and over again and tend to lose their meaning. Does anyone have good resources for exception patterns?

Thursday, October 04, 2007

Subversion notes

We recently converted from cvs to Subversion and overall I like it much better than cvs but:

  • I wish cvs2svn converted .cvsignore files to svn:ignore properties.
  • I wish it handled tags and branches natively instead of using the copy method. Tags and branches are core parts of version control and I think something is lost by implementing them non-natively.
  • I hope svn:externals will be able to point to a specific file soon.
  • I wish update and status didn't print so much extra info about the externals. For status I have to run svn status --ignore-externals --show-updates %* | grep -v "^[XI]" | grep -v "^Status against revision" just to get what I want.

Tuesday, October 02, 2007

Converting .cvsignore file to Subversion svn:ignore property

If you have converted a cvs repository to a Subversion repository and need to convert .cvsignore files to subversion svn:ignore properties then the following script should be useful to you. Note that it probably doesn't handle spaces in filenames.

#!/bin/bash

find . -name ".cvsignore" -print | sed "s/\/.cvsignore//" | tee ignoredirs.txt

for i in `cat ignoredirs.txt`
do
echo Processing $i/.cvsignore
svn propset svn:ignore -F "$i/.cvsignore" "$i"
svn remove "$i/.cvsignore"

done

Friday, September 28, 2007

Garbage collecting file system

Why don't current file systems implement garbage collection so that deleting a directory takes constant time (regardless of how many descendants it has)?

Google has done that See section 4.4 (page 8) of http://209.85.163.132/papers/gfs-sosp2003.pdf

Subversion: Setting svn:keywords property

If you converted from cvs and need to set the svn:keywords property you can use the following script (doesn't handle spaces). I wrote this before I realized that cvs2svn already set the property and I therefore didn't need it.


#!/bin/bash

echo Seaching...
# Put Keywords up one directory so the 2nd-last greps don't search it.
grep -R -H "\$Date:" * | grep "\$Date:" > ../Keywords.txt
grep -R -H "\$Revision:" * | grep "\$Revision:" >> ../Keywords.txt
grep -R -H "\$Author:" * | grep "\$Author:" >> ../Keywords.txt
grep -R -H "\$Id:" * | grep "\$Id:" >> ../Keywords.txt

echo Setting Properties...
cat ../Keywords.txt | sed "s/:.*$//" | sed "s/^/\"/" | sed "s/$/\"/" | xargs -L 1 svn propset svn:keywords "Date Revision Author Id"
echo Results:
svn status

Tuesday, September 25, 2007

Remove empty Subversion directories

I couldn't find any scripts that would prune empty Subversion controlled directories so I wrote the following. Enjoy!

#!/usr/bin/bash

for i in `find . -type d -print -name ".svn" | grep -v "/\.svn"`
do
if [ -z "`ls -A $i | grep -v '\.svn' `" ]
then
echo Removing $i
ls -A $i
svn remove $i
fi
done
echo Results:
svn status


Update:
The script above doesn't handle spaces in the filename so use the following two scripts if you have spaces in any filename:


svnPrune.sh
#!/bin/bash

echo Searching...
find . -type d ! -path "*\.svn*" ! -name "\." ! -name "tags" ! -name "branches" -exec /svn/svnPruneDirectory.sh {} \;

echo Results:
svn status

svnPruneDirectory.sh
#!/bin/bash
directory="$*"
#echo Processing $directory ------------------------------------------------------

files=`ls -A "$directory" | grep -v '\.svn'`
#echo files=$files
childrenCount=`ls -A "$directory" | grep -v '\.svn' | wc -l`
#echo $directory: $childrenCount
if [ "$childrenCount" -eq "0" ]
then
echo Removing $directory
# echo $files
svn remove "$directory"
fi

Monday, September 24, 2007

Terminal window with 3 frames

I think it would be useful to have a terminal window with 3 frames.
  1. Output frame (At the top): Shows all program ouput and input/commands that the have been processed.
  2. Input frame (in the middle): Allows the user to enter standard input. Only show when a program is waiting for standard input.
  3. Command frame (at the bottom): Allows the user to enter commands.
Benefits:
  • It would be obvious when I wrote a command that is waiting for user input.
  • It could allow you to provide standard input via a command after running another command.
This could probably also be done in one frame with coloring to show the different parts. The prompt would change color when waiting for standard input (as opposed to waiting for the process to finish).

What do you think? Does something like this exist?

Friday, September 07, 2007

Java applet to check Java version

Out product only requires Java 1.4 (or higher) but we want to require Java 1.5. About 10% of applet requests used Java 1.4 so we want a way to notify the users who use Java 1.4 that they should upgrade. Searching with Google I couldn't find any such applet so here is my version.


JavaVersionCheck.class:
import javax.swing.*;

/**
* Prints the <code>message</code> if the java version doesn't match
* <code>javaVersionPattern</code>.
* @author James Stauffer
*/
public class JavaVersionCheck extends JApplet {
public void init() {
try {
String javaVersionPattern = getParameter("javaVersionPattern");
String message = getParameter("message");
if(VersionMatches(javaVersionPattern)) {
add(new JLabel(message));
}
} catch (Exception e) {
e.printStackTrace();
}
}

public static boolean VersionMatches(String javaVersionPattern) {
String javaVersion = System.getProperty("java.version");
return javaVersion.matches(javaVersionPattern);
}

public static void main(String args[]) {
if(args.length > 0) {
String javaVersionPattern = args[0];
if(VersionMatches(javaVersionPattern)) {
if(args.length > 1) {
System.out.println(args[1]);
} else {
System.out.println("Matches " + javaVersionPattern);
}
}
}
}
}


HTML:

<applet height="30" width="300" code="JavaVersionCheck.class">
<param name="javaVersionPattern" value="^1\.4.*">
<param name="message" value="Java 1.4 is not supported. Please upgrade to 1.5.">
</applet>

Friday, August 10, 2007

toString() cost

Do you assume that toString() on any given object has a low cost? I do. Is that assumption valid? If it has a high cost should that normally be changed? What are valid reasons to make a toString() method with a high cost?

Thursday, August 02, 2007

Dynamic/static language

What if a language allow both static and dynamic types. That might allow the best of both worlds. i.e.:
String str = "Hello";
var temp = str;
temp = 10;

1. Would that be possible?
2. Would that be beneficial?

Wednesday, June 27, 2007

Debugger for *nix pipe commands

As I build *nix piped commands I find that I want to see the output of one stage to verify correctness before building the next stage but I don't want to re-run each stage. Does anyone know of a program that will help with that? It would keep the output of the last stage automatically to use for any new stages. I usually do this by sending the result of each command to a temporary file but it would be nice for a program to handle this.

Monday, June 25, 2007

10 things to do today

Here is my list of 10 things to do today:
  • Code
  • Goto #1

Friday, May 18, 2007

Telecommuting tips

I have been telecommutting 1-2 days/week for a few years so I present the following tips to those who want to telecommute:
  1. Practice communicating by email (especially when it is easier to talk in person about something) so you get better at writing emails that are complete and easy to understand.
  2. Practice doing as many normal activities on your telecommuting days as possible. Minimize "waiting until you get back to the office".
  3. Practice minimizing differences between your telecommuting and in-office days so that your telecommuting doesn't negatively affect the company.
For me #1 was the hardest to learn.

Monday, May 07, 2007

Method return values for null objects

Would it be useful to be able to provide method return value for null objects?
For a List the null return values might be:
get(int) : null
size() : 0
iterator() would be an empty iterator

That would allow the following code that has less null checks.

List items = null;

if(something) {
items = ...
}

for(int index = 0; index < items.size(); index++) {
Object obj = items.get(index);
}