My thoughts as an enterprise Java developer.
Thursday, December 31, 2009
Official Google Blog: Browser Size: a tool to see how others view your website
Thursday, November 05, 2009
TLS (SSL) compromised!
Monday, November 02, 2009
Open source as an antitrust strategy | The Open Road - CNET News
As an added benefit, it's a great way for companies to collaborate without running afoul of antitrust laws. It's collusion without the collusion."
Wednesday, October 21, 2009
Google Public Policy Blog: Vint Cerf on the importance of keeping the Internet open
Government regulation will mean that it isn't open! Once the government takes control it will only increase its control and the regulation will limit good experimentation.
Tuesday, October 13, 2009
[JavaSpecialists 177] - Logging Part 3 of 3
import org.apache.log4j.*; import org.apache.log4j.spi.*; import java.io.*; public class MultipleLineLayout extends Layout { public String format(LoggingEvent event) { Object o = event.getMessage(); if (o instanceof Throwable) { return format((Throwable) o); } return format(String.valueOf(event.getMessage())); } private String format(Throwable t) { StringWriter out = new StringWriter(); PrintWriter pw = new PrintWriter(out); t.printStackTrace(pw); pw.close(); return format(out.toString()); } private String format(String s) { return s. replaceAll("\r", ""). // remove Windows carriage returns replaceAll("\n*$", ""). // remove trailing newline chars replaceAll("\n", " |>> ") // replace newline with |>> + "\n"; } public boolean ignoresThrowable() { return false; } public void activateOptions() { // not necessary } }
Thursday, September 24, 2009
Microsoft bashes Google's Chrome-in-IE plan | Beyond Binary - CNET News
However, some took Microsoft to task for criticizing plug-ins, noting that Redmond itself has more than a few.
'Microsoft scared of security of plug-ins. Uninstall Silverlight now,' Mozilla's Dion Almaer wrote in a Twitter posting."
It's not like MS or IE has a stellar security record.
Tuesday, September 08, 2009
Distributed Relational Database Architecture
Distributed Relational Database Architecture
James Stauffer
August 1, 2009
Prepared for CS425 Summer 2009
Table of Contents
I. Introduction
A. Focus
B. Implementation
II. Relevent Features
A. Availability
B. Failover
C. Replication
D. Partitioning
E. Single Point of Failure
F. Shared Storage
G. Inter-node Communication
III. Shared Everything: Oracle RAC (Real Application Cluster)
A. Maintenance availability
B. Inter-node communication
C. Permanent Storage
D. Client interaction
E. Further information
IV: Shared Nothing: My SQL Cluster
A. Maintenance availability
B. Inter-node communication
C. Permanent Storage
D. Client interaction
E. Further information
V. Sharding
A. Maintenance availability
B. Inter-node communication
C. Permanent Storage
D. Client interaction
E. Further information
VI. Conclusion
- Introduction
- Focus
This paper will cover distributed relational database architecture, primarily as it affects availability to the client. Features that allow the database to run more complex queries or handle larger data sets (scalability) will not be covered. All discussion will focus on systems with multiple servers (call nodes) involved.
- Implementations
This paper will cover three basic types of implementations: Shared Everything, Shared Nothing, and Sharding. Shared Everything implementations don't actually share everything but just share the data storage between nodes, while shared nothing systems have nodes that are completely independent (including storage) and self-sufficient. Systems that present the nodes as one or as interchangeable are generally called clusters and include shared everything and shared nothing. Sharding is similar to shared nothing in that nothing in the nodes is shared, and the nodes are independent. However, sharding nodes are seen as completely separate and independent (but related) by the client. Shared everything systems need to synchronize data access and can do it either through a shared storage device or with direct communication. Oracle RAC will be the example reviewed for shared everything, and MySQL will be the example reviewed for shared nothing. Sharding doesn’t have specific support in any major database, so it will be generically reviewed with no example. For each type of architecture, the following will be addressed: maintenance availability, inter-node communication, permanent storage, and client interaction.
- Relevant Features
Some major aspects relevant to highly available distributed relational database architecture include: availability, failover, replication, partitioning, single point of failure, shared storage, and inter-node communication.
- Availability
Availability refers to the percent of time that a service is responsive and working correctly (synonymous with uptime). It is usually expressed in a percent with 99.999% (5 nines) considered a very high level of availability. Downtime refers to the time that a service is not available (i.e. the opposite of availability). Both planned and unplanned events can affect availability, therefor both need to be addressed. All types of database systems try to minimize unplanned downtime to some extent, but vary in how much maintenance can be done without downtime. Techniques to minimize unplanned downtime include: automatic transferring of connections from failed to working nodes, automatic restart of failed nodes/services, and optimal checkpointing to minimize restart time. Techniques to minimize planned downtime include supporting maintenance operations while the service running is one of the following: allowing patches, updates, reconfiguration, adding or removing nodes, adding or removing databases.
- Failover
Failover refers to what happens when one node fails, and how another node (or nodes) starts handling the service that had been provided by the failed node. Failover may be mostly transparent to the client, or can require the client to connect to another node. The client can have a front-end, such as a JDBC driver, that can handle some of the cluster activity (i.e. connecting to another node if the cluster doesn't handle that automatically). Whether or not the client is automatically redirected to another node, any transactions open on the failed node will also fail.
- Replication
Replication refers to the copying of data to another system or node – especially so that when a node fails, no data is lost and the data is immediately available. Asynchronous replication happens in the background so that the client receives the result of its request much quicker, and does have the risk that the node will fail before all data is replicated. Synchronous replication happens before the result is sent back to the client, but makes each client request take longer. There can also be multiple levels of replication. Replication can be done to the memory of two nodes synchronously to provide fast response (and minimilize the chance of data loss): and asynchronous replication can be made to permanent storage.
- Partitioning
Partitioning refers to splitting the data into parts, and distributing each part to a separate node. This allows each node to have less data to handle. One problem, however, is that it can cause the need for requests to contact many nodes in which to access all needed data. Therefore, partitioning may not be appropriate for all types of databases. When a node has less data to handle, it can more effectively cache data in memory, it can have more efficient indexes, and therefore it can increase performance.
- Single point of failure
A single point of failure is a single component of the system that is used by the whole system, and if that component fails, the whole system will fail. A system with a single point of failure is much more at risk for failure, therefor a single point of failure should be avoided. In a highly available system, as many components as possibleare duplicated to help avoid as many single points of failure as possible. Even when nodes are duplicates, components on each node are often duplicated to reduce the chance of failure on an individual node (i.e. network interface, storage interface, power supply, etc). The single point of failure assessment is done on many levels – from the storage system (drives, connectors, controllers, power), all the way up to datacenter power and network connection.
- Shared storage
Shared storage is permanent database storage that is shared between all or many nodes. Like anything else shared, shared storage can be a risk as a single point of failure. Shared storage can also be used as a communication channel. Types of shared storage include a database (however, using a database for shared storage of a database isn't used), NAS(Network attached storage), SAN(Storage Area Network), external SCSI disk, cloud storage (such as Amazon S3), etc. The connection to the shared storage is generally duplicated (i.e. two network cards, SCSI controllers, etc). The single point of failure risk can be addressed by duplicating the data into two identical shared storage systems through a process called mirroring.
- Inter-node communication
Inter-node communication refers to how the nodes communicate with each other. Nodes can communicate through shared storage, or over a network. When communication is over a network, it is generally a fast network (gigabit or faster), a network that is private to the nodes (accessed only by the nodes), and a network that is called an interconnect. Each node generally has two network interfaces for improved fault tolerance. Because the network is slower than memory, the inter-node communication of a cluster can make each action slower in a cluster, as opposed to communication on a single database.
- Shared Everything: Oracle RAC (Real Application Cluster)
- Maintenance Availability
Adding a new node to the cluster doesn't involve cluster downtime, however clients may not completely use the new node until they are told about it. Some actions, like parallel queries, will immediately take advantage of the new node. All cluster nodes can be managed as one, or managed separately, as deemed necessary. Some code upgrades (patches) to the DBMS (database management system) can be applied to the cluster. This is done one node at a time, in a rolling fashion, so that all nodes can be upgraded without service downtime.
- Inter-node Communication
The nodes communicate updates (for cache), locking, etc. between each other over an interconnect (older versions communicated over the file system). When a node needs to write to a data block, it first sends a request across the cluster to transfer ownership of the data block to itself. It appears that each cluster has a master that tracks which node owns each block, therefor this design doesn't slow a single update as nodes are added to the cluster because ownership transfer only involves threenodes: requester, master, and current owner. Since only one node can own a block at one time, blocks that are updated often can cause ownership to jump around between nodes, often degrading performance.
- Permanent Storage
Oracle RAC uses shared storage. The shared storage can be NAS, SAN, or SCSI disk. ASM (Automatic Storage Management) can address the storage single point of failure risk by mirroring data across different failure groups (a set of disks sharing something in common, such as a controller). Since the file system is shared, all volume management must be designed to work with the cluster.
- Client Interaction
Clients connect to the nodes with Virtual IP (VIP) addresses so that if a node fails, the VIP can be redirected to another node. The client needs to know the VIP for all nodes. Clients can use Fast Application Notification (FAN), Fast Connection Failover (FCF), and Transparent Application Failover (TAF) to detect and/or handle node failure. However, it may require the client to re-do some work, and it may be hard to determine which options are current, and which options will work best. It can be a huge code change to change a program to detect and redo actions every place that the database is used. Load balancing is supported by the client having the list of all nodes, and randomly choosing a node for a connection (so that connections are spread across nodes). Alternatively, the client can get load information from the listener running on the chosen node. This is done so that the listener can direct the client to another node that has more resources available (which all depends on the connection option chosen.)
- Further Information
http://en.wikipedia.org/wiki/Oracle_RAC
http://www.oracle.com/technology/products/database/clustering/index.html
- Shared Nothing: MySQL Cluster
A MySQL cluster has 3 node types: data nodes to store the data, SQL nodes to run a MySQL server, and management nodes. Therefore a minimum of five nodes is generally needed for high availability (1 management, 2 SQL, and 2 data each with 1 replica of the data).
- Maintenance Availability
Adding and dropping data nodes requires that the cluster be restarted. One exception is that data nodes for new partitions can be added while the cluster is running. If one node fails, failover automatically happens to another node, and any transaction information on the failed node is lost. The failover only takes sub-second time to happen. Adding space to an existing database by adding a data node, requires that the data be repartitioned to include the new node (so that the data is spread out over the new set of nodes). Therefore, adding and repartitioning causes downtime and uses a lot of resources. Rolling software updates are supported.
- Inter-node Communication
Nodes communicate on a private interconnect. Because there is only one primary owner of each block (the primary replica), ownership of the block doesn’t need to be transferred, and updates only have to involve 2 nodes (the SQL node processing the request, and the master data node that owns the block).
- Permanent Storage
The data is partitioned across the data nodes. Each piece of data can have multiple replicas (with one node being the master, and the others being slaves) so that the data exists on multiple nodes, and so that there is no single point of failure. When a node needs to write a block, it can replicate the data either asynchronously or synchronously. If the node is configured to replicate asynchronously, it first replicates the data to all other data nodes that have a replica, and then asks them if they can commit the change. If all reply affirmatively,the node then sends another message to tell all other nodes to commit the change (two phase commit). Since the data is replicated to other nodes, each node can replicate to permanent storage asynchronously.
- Client Interaction
Clients can connect to the cluster through a load balancer, or through a proxy, so that they don't have to be aware of the individual SQL nodes. Client reads can be done on the replicated nodes to provide better performance because the access can be spread out over more data nodes. Read and write lock conflicts can also be reduced if the masters are setup to only handle writes, and the slaves only handle reads. The reduced contention also increases performance.
- Further Information
http://en.wikipedia.org/wiki/MySQL_Cluster
http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster.html
http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication.html
http://dev.mysql.com/doc/refman/5.1/en/replication.html
- Sharding
- Maintenance Availability
Since each shard is independent, all maintenance on one shard has no affect on the rest of the shards. The only exception is if the data needs to be repartitioned across the shards, then all shards can be affected. However, repartitioning can be done without affecting availability. Adding a shard can also require repartitioning. Sharding takes more work to manage because of the work involved in repartitioning to keep the load balanced, or when adding or removing a shard. Each shard could have many of the techniques for normal clusters used, but they are used independently, so any bottlenecks are for a smaller data set. Balancing the load between servers can be difficult. For example, some users may use more resources, thus using some shards much more than others.
- Inter-node Communication
There is no inter-node communication with sharding, and problems on one server can't affect other servers.
- Permanent Storage
Sharding splits the data into partitions/shards (i.e. by groups of users) and puts each portion on a completely separate DB system. This removes write bottlenecks across shards without the potential for data inconsistency. There is no specific storage architecture since each shared can use any storage architecture.
- Client Interaction
The client either has to know the sharding algorithm (so that the client knows which shard to contact), or there must be a shard lookup service that the client uses. The shard lookup service will have the potential to be both a single point of failure, and a bottleneck (however, it should be used much less so that it isn't a bottleneck). Determining the correct shard to contact can increase the complexity of the client code. Also, queries across groups are more difficult, therefore, systems that require a lot of queries across shards may not work well with a sharding system.
- Further Information
http://highscalability.com/unorthodox-approach-database-design-coming-shard
- Conclusion
Sharding has the highest maintenance availability, MySQL cluster has the lowest maintenance availability. Sharding has the lowest inter-node communication, Oracle cluster has the highest inter-node communication. MySQL cluster has the most fault-tolerant permanent storage, sharding has the least fault-tolerant. MySQL cluster has the least complex client interaction, Oracle RAC has the most complex. Depending on the exact features needed, and the type of data used, each could be the best solution However, for this focused comparison, MySQL cluster provides the best combination.
Friday, August 28, 2009
Bill would give president emergency control of Internet | Politics and Law - CNET News
Why should we expect the executive branch to have the expertise and information to know how to best use that control?
Tuesday, August 04, 2009
Defeating Character Frequency Analysis of Substitution Ciphertext
Introduction
Character frequency analysis is a common way to crack substitution ciphertext because of the uneven distribution of letters in languages. In order to defeat that analysis, the techniques considered were: synonyms, dropping letters or words, adding extra letters or words, alternate spelling (Leet, spam, etc), and Two cipher letters per plaintext letter. Most of these techniques would be applied before a standard substitution cipher encryption. To evaluate the techniques, test data was obtained from Project Gutenberg(http://www.gutenberg.org/) and consisted of the texts of Psalms (Ps), Pride and Prejudice (P&P), Alice in Wonderland (AiW), and Sherlock Holmes (SH). After applying the technique, the output character frequency was compared to the standard character frequency (http://en.wikipedia.org/wiki/Letter_frequencies and http://www.data-
compression.com/english.html) and then the standard deviation of the proportion of the changed frequency for each letter compared to the standard character frequency. i.e. If the letter "e" occurred 10% of the time in the changed text then the proportion for "e" is .787 (.10/.12702). That gives an overall measure of how close the changed text character frequency is to the standard character frequency. A standard deviation of 0% would mean that it exactly matches the standard frequency. Initial standard deviations are: Ps: 25%, P&P: 28%, AiW: 25%, SH: 14%. For comparison, an equal distribution produces a standard deviation of 1,439%. In LetterFrequencies.zip there are programs for most of these options and for computing the standard deviation.
Synonyms
Replacing words with synonyms that have a higher standard deviation can help defeat character frequency analysis. The thesaurus used has synonyms in categories, so when a word was looked up, the first category was arbitrarily used. Within the first category, the word that had the largest standard deviation from the normal character frequency was chosen. The chosen work often didn't fit grammatically, so the results are skewed, but this is somewhat offset by only looking in the first category. To make this usable, grammar checking would have to be used. It would probably need a UI that would suggest alternatives (based on grammar rules and the standard deviation). Attacks against this technique could include creating a list of the best synonyms, creating a new standard character frequency distribution for those words, and therefore largely defeat the benefit of this technique. Final standard deviation achieved: Ps: 182%, P&P: 180%, AiW: 135%, SH: 149%. The increase in standard deviation was very significant, but not enough to make a large impact on frequency analysis.
Dropping letters or words
Randomly dropping letters or words could easily hide the meaning from the intended recipient so it would be hard to keep the meaning while still randomly dropping enough to make an impact. An attempt to always drop vowels and rare letters (a,e,i,o,u,x,q) and words (I, the, is, was, of, a, & an) made an insignificant difference (Drop Words Ps: 8%, P&P: 6%, AiW: 8%, SH: 8%; Drop Letters: Ps: 73%, P&P: 91%, AiW: 69%, SH: 70%). Also, some of each letter would need to be dropped to prevent the non-dropped letters from serving as landmarks.
Adding extra letters or words
To be effective, extra letters and words must be randomly placed. They could be chosen based on the inverse of distribution. However, additions that have a significant impact on the standard deviation may garble the message too much. Add letters results(based on inverse of distribution): P&P: 2,154%, SH: 2,795%. The result is readable but takes significantly more work to understand the message.
Alternate spelling (Leet, spam, etc)
Alternate spelling includes l33t (http://en.wikipedia.org/wiki/Leet) and spam spelling (http://en.wikipedia.org/wiki/E-mail_spam#Obfuscating_message_content) and is similar to synonyms, in that the possibility of a new new standard character frequency has to be taken into account. This technique was not implemented or programatically evaluated because it was non-obvious what to use as the standard distribution.
Two cipher letters per plaintext letter
Using two letters in the cipher text for each letter in the plaintext can be a good way to create a flat character distribution. The algorithm is to partition the 676 2-letter combinations based on the standard character frequency. i.e. if the standard frequency for a letter is 5% then it will get 5% of the 2-letter combinations (randomly selected). This doubles the size of the data, could include spaces & punctuation, and makes a much larger key. Note that some letters may get dropped because they occur less than 1/676 (0.15%) of the time. Both 1-gram and 2-gram frequency analysis produce a nearly uniform histogram (variation appears to only be caused by rounding). Two-gram results: P&P=5,117%; SH=5,013%. Therefore this technique was extremely effective with no obvious weaknesses.
Conclusion
Changing the plaintext proved problematic because of inadvertent changes to the meaning, but it did make a significant impact on standard deviation even though it probably wasn't significant enough. Applying all of the plaintext changes to a short message while using safeguards to protect the meaning would probably be effective. Two cipher letters per plaintext letter appears to be the easiest way to defeat character frequency analysis.
Friday, July 31, 2009
Sample chapter from Don't Make Me Think
Because you stop looking when you find them.
—Children’s riddle"
Great info on usability.
Thursday, July 30, 2009
Pressure sensitive keyboard for repeat rate
Wednesday, July 29, 2009
Ex-Google CIO breaks his own security rules | InSecurity Complex - CNET News
So are you going to have a project for every website that users might find useful? There is no way that you will keep up and the barrier to using useful websites will significantly hurt productivity.
Beyond the hype: Where open source actually saves you money | The Open Road - CNET News
"Open source tends to offer best-of-breed solutions that aim to do a limited range of functions well, rather than to be all things to all people."
Saturday, July 25, 2009
MIT OpenCourseWare | Electrical Engineering and Computer Science | 6.854J Advanced Algorithms, Fall 2008 | Home
Electrical Engineering and Computer Science | 6.854J Advanced Algorithms, Fall 2008 | Home: "This is a graduate course on the design and analysis of algorithms, covering several advanced topics not studied in typical introductory courses on algorithms. It is especially designed for doctoral students interested in theoretical computer science."
Monday, June 22, 2009
Pattern Formatter for java.util.logging
* LoggerName %LOGGER%
* Level %LEVEL%
* Time %TIME%
* Message %MESSAGE%
* SourceClassName %SOURCECLASS%
* SourceMethodName %SOURCEMETHOD%
* Exception Message %EXCEPTION%"
Thursday, May 07, 2009
Save Icon
Thursday, April 16, 2009
FBI seizures highlight law as cloud impediment | The Wisdom of Clouds - CNET News
The problem is that they didn't just grab the systems belonging to the VoIP vendors, but grabbed hundreds of servers serving a wide variety of businesses, the vast majority of which had never dealt with or even heard of the companies under investigation, according to Threat Level. Company officials interviewed complained of losing millions of dollars in lost revenue and equipment with no warning whatsoever."
"If the court upholds that servers can be seized despite no direct warrants being served on the owners of those servers (or the owners of the software and data housed on those servers), then imagine what that means for hosting your business in a cloud shared by thousands or millions of other users."
"Here is what I argue must happen:
The law must respect digital assets in the same way that it respects physical assets. This means that search and seizure rules should apply to data and software run on third party infrastructure (or wholly owned infrastructure run in third party facilities) in the same way that they protect my home and personal property if I rent an apartment in a building housing hundreds of tenants. The fact that one tenant commits a crime is not enough for the civil liberties of all of the other tenants to be null and void. I argue the same goes for digital assets "renting" space in the cloud.
The federal government should adopt a cloud computing bill of rights. (Here is a rudimentary example.) Each state should as well. Declare loud and clear that you suffer little or no loss of rights if you choose to run your business in the cloud over running it within your own facilities. Repeal or revise the laws that make it impossible for foreign businesses and governments to allow communications and data to pass within U.S. borders (including relevant elements of the Patriot Act).
It is time for our policy makers to step up and really understand the influence that the Internet and cloud computing will have on the future growth of this country. It is scary how little technical understanding most congressional and senate members have. However, that alone is not an excuse for not grasping the policy gaps that are brought about as our commerce and society rely increasingly upon Internet-based services."
Monday, March 30, 2009
Wednesday, January 28, 2009
Caching strategy
A quick Google search didn't find anything like this. Comments?