My thoughts as an enterprise Java developer.

Tuesday, September 08, 2009

Distributed Relational Database Architecture

Distributed Relational Database Architecture

James Stauffer

August 1, 2009

Prepared for CS425 Summer 2009

 

 

Table of Contents

 

        I. Introduction

                A.  Focus

                B.  Implementation

        II.  Relevent Features 

                A.  Availability

                B.  Failover

                C.  Replication

                D.  Partitioning

                E.  Single Point of Failure

                F.  Shared Storage

                G.  Inter-node Communication

        III.  Shared Everything: Oracle RAC (Real Application Cluster

                 A. Maintenance availability

                 B.  Inter-node communication

                 C.  Permanent Storage

                 D.  Client interaction

                 E.  Further information 

        IV:  Shared Nothing:  My SQL Cluster  

                 A. Maintenance availability

                 B.  Inter-node communication

                 C.  Permanent Storage

                 D.  Client interaction

                 E.  Further information                

        V.  Sharding      

                 A. Maintenance availability

                 B.  Inter-node communication

                 C.  Permanent Storage

                 D.  Client interaction

                 E.  Further information                 

         VI.  Conclusion

 

 

 

  1. Introduction
    1. Focus

This paper will cover distributed relational database architecture, primarily as it affects availability to the client.  Features that allow the database to run more complex queries or handle larger data sets (scalability) will not be covered.  All discussion will focus on systems with multiple servers (call nodes) involved.

    1. Implementations

This paper will cover three basic types of implementations: Shared Everything, Shared Nothing, and Sharding.  Shared Everything implementations don't actually share everything but just share the data storage between nodes, while shared nothing systems have nodes that are completely independent (including storage) and self-sufficient.  Systems that present the nodes as one or as interchangeable are generally called clusters and include shared everything and shared nothing.  Sharding is similar to shared nothing in that nothing in the nodes is shared, and the nodes are independent.  However, sharding nodes are seen as completely separate and independent (but related) by the client.  Shared everything systems need to synchronize data access and can do it either through a shared storage device or with direct communication.  Oracle RAC will be the example reviewed for shared everything, and MySQL will be the example reviewed for shared nothing.  Sharding doesn’t have specific support in any major database, so it will be generically reviewed with no example.  For each type of architecture, the following will be addressed: maintenance availability, inter-node communication, permanent storage, and client interaction.

  1. Relevant Features

Some major aspects relevant to highly available distributed relational database architecture include: availability, failover, replication, partitioning, single point of failure, shared storage, and inter-node communication.

    1. Availability

Availability refers to the percent of time that a service is responsive and working correctly (synonymous with uptime).  It is usually expressed in a percent with 99.999% (5 nines) considered a very high level of availability.  Downtime refers to the time that a service is not available (i.e. the opposite of availability).  Both planned and unplanned events can affect availability, therefor both need to be addressed.  All types of database systems try to minimize unplanned downtime to some extent, but vary in how much maintenance can be done without downtime.  Techniques to minimize unplanned downtime include: automatic transferring of connections from failed to working nodes, automatic restart of failed nodes/services, and optimal checkpointing to minimize restart time.  Techniques to minimize planned downtime include supporting maintenance operations while the service running is one of the following: allowing patches, updates, reconfiguration, adding or removing nodes, adding or removing databases.

    1. Failover

Failover refers to what happens when one node fails, and how another node (or nodes) starts handling the service that had been provided by the failed node.  Failover may be mostly transparent to the client, or can require the client to connect to another node.  The client can have a front-end, such as a JDBC driver, that can handle some of the cluster activity (i.e. connecting to another node if the cluster doesn't handle that automatically).  Whether or not the client is automatically redirected to another node, any transactions open on the failed node will also fail.

    1. Replication

Replication refers to the copying of data to another system or node – especially so that when a node fails, no data is lost and the data is immediately available.  Asynchronous replication happens in the background so that the client receives the result of its request much quicker, and does have the risk that the node will fail before all data is replicated.  Synchronous replication happens before the result is sent back to the client, but makes each client request take longer.  There can also be multiple levels of replication. Replication can be done to the memory of two nodes synchronously to provide fast response (and minimilize the chance of data loss): and asynchronous replication can be made to permanent storage.

    1. Partitioning

Partitioning refers to splitting the data into parts, and distributing each part to a separate node.  This allows each node to have less data to handle. One problem,  however, is that it can cause the need for requests to contact many nodes in which to access all needed data.  Therefore, partitioning may not be appropriate for all types of databases.  When a node has less data to handle, it can more effectively cache data in memory, it can have more efficient indexes, and therefore it can increase performance.

    1. Single point of failure

A single point of failure is a single component of the system that is used by the whole system, and if that component fails, the whole system will fail.  A system with a single point of failure is much more at risk for failure, therefor a single point of failure should be avoided.  In a highly available system, as many components as possibleare duplicated to help avoid as many single points of failure as possible.  Even when nodes are duplicates, components on each node are often duplicated to reduce the chance of failure on an individual node (i.e. network interface, storage interface, power supply, etc).  The single point of failure assessment is done on many levels – from the storage system (drives, connectors, controllers, power), all the way up to datacenter power and network connection.

    1. Shared storage

Shared storage is permanent database storage that is shared between all or many nodes.  Like anything else shared, shared storage can be a risk as a single point of failure.  Shared storage can also be used as a communication channel.  Types of shared storage include a database (however, using a database for shared storage of a database isn't used), NAS(Network attached storage), SAN(Storage Area Network), external SCSI disk, cloud storage (such as Amazon S3), etc.  The connection to the shared storage is generally duplicated (i.e. two network cards, SCSI controllers, etc).  The single point of failure risk can be addressed by duplicating the data into two identical shared storage systems through a process called mirroring.

    1. Inter-node communication

Inter-node communication refers to how the nodes communicate with each other.  Nodes can communicate through shared storage, or over a network.  When communication is over a network, it is generally a fast network (gigabit or faster), a network that is private to the nodes (accessed only by the nodes), and a network that is called an interconnect.  Each node generally has two network interfaces for improved fault tolerance. Because the network is slower than memory, the inter-node communication of a cluster can make each action slower in a cluster, as opposed to communication on a single database.

  1. Shared Everything: Oracle RAC (Real Application Cluster)
    1. Maintenance Availability

Adding a new node to the cluster doesn't involve cluster downtime, however clients may not completely use the new node until they are told about it. Some actions, like parallel queries, will immediately take advantage of the new node. All cluster nodes can be managed as one, or managed separately, as deemed necessary. Some code upgrades (patches) to the DBMS (database management system) can be applied to the cluster.  This is done one node at a time, in a rolling fashion, so that all nodes can be upgraded without service downtime.

    1. Inter-node Communication

The nodes communicate updates (for cache), locking, etc. between each other over an interconnect (older versions communicated over the file system).  When a node needs to write to a data block, it first sends a request across the cluster to transfer ownership of the data block to itself. It appears that each cluster has a master that tracks which node owns each block, therefor this design doesn't slow a single update as nodes are added to the cluster because ownership transfer only involves threenodes: requester, master, and current owner.  Since only one node can own a block at one time, blocks that are updated often can cause ownership to jump around between nodes, often degrading performance.

    1. Permanent Storage

Oracle RAC uses shared storage.  The shared storage can be NAS, SAN, or SCSI disk.  ASM (Automatic Storage Management) can address the storage single point of failure risk by mirroring data across different failure groups (a set of disks sharing something in common, such as a controller).  Since the file system is shared, all volume management must be designed to work with the cluster.

    1. Client Interaction

Clients connect to the nodes with Virtual IP (VIP) addresses so that if a node fails, the VIP can be redirected to another node. The client needs to know the VIP for all nodes. Clients can use Fast Application Notification (FAN), Fast Connection Failover (FCF), and Transparent Application Failover (TAF) to detect and/or handle node failure.  However, it may require the client to re-do some work, and it may be hard to determine which options are current, and which options will work best.  It can be a huge code change to change a program to detect and redo actions every place that the database is used.  Load balancing is supported by the client having the list of all nodes, and randomly choosing a node for a connection (so that connections are spread across nodes).  Alternatively, the client can get load information from the listener running on the chosen node.  This is done so that the listener can direct the client to another node that has more resources available (which all depends on the connection option chosen.)

    1. Further Information

http://en.wikipedia.org/wiki/Oracle_RAC

http://www.oracle.com/technology/products/database/clustering/index.html

  1. Shared Nothing: MySQL Cluster

A MySQL cluster has 3 node types: data nodes to store the data, SQL nodes to run a MySQL server, and management nodes.  Therefore a minimum of five nodes is generally needed for high availability (1 management, 2 SQL, and 2 data each with 1 replica of the data).

    1. Maintenance Availability

Adding and dropping data nodes requires that the cluster be restarted.  One exception is that data nodes for new partitions can be added while the cluster is running.  If one node fails, failover automatically happens to another node, and any transaction information on the failed node is lost.  The failover only takes sub-second time to happen.  Adding space to an existing database by adding a data node, requires that the data be repartitioned to include the new node (so that the data is spread out over the new set of nodes). Therefore, adding and repartitioning causes downtime and uses a lot of resources. Rolling software updates are supported.

    1. Inter-node Communication

Nodes communicate on a private interconnect.  Because there is only one primary owner of each block (the primary replica), ownership of the block doesn’t need to be transferred, and updates only have to involve 2 nodes (the SQL node processing the request, and the master data node that owns the block).

    1. Permanent Storage

The data is partitioned across the data nodes.  Each piece of data can have multiple replicas (with one node being the master, and the others being slaves) so that the data exists on multiple nodes, and so that there is no single point of failure.  When a node needs to write a block, it can replicate the data either asynchronously or synchronously. If the node is configured to replicate asynchronously, it first replicates the data to all other data nodes that have a replica, and then asks them if they can commit the change.  If all reply affirmatively,the node then sends another message to tell all other nodes to commit the change (two phase commit).  Since the data is replicated to other nodes, each node can replicate to permanent storage asynchronously.

    1. Client Interaction

Clients can connect to the cluster through a load balancer, or through a proxy, so that they don't have to be aware of the individual SQL nodes.  Client reads can be done on the replicated nodes to provide better performance because the access can be spread out over more data nodes.  Read and write lock conflicts can also be reduced if the masters are setup to only handle writes, and the slaves only handle reads. The reduced contention also increases performance.

    1. Further Information

http://en.wikipedia.org/wiki/MySQL_Cluster

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster.html

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication.html

http://dev.mysql.com/doc/refman/5.1/en/replication.html

  1. Sharding
    1. Maintenance Availability

Since each shard is independent, all maintenance on one shard has no affect on the rest of the shards.  The only exception is if the data needs to be repartitioned across the shards, then all shards can be affected.  However, repartitioning can be done without affecting availability.  Adding a shard can also require repartitioning.  Sharding takes more work to manage because of the work involved in repartitioning to keep the load balanced, or when adding or removing a shard.  Each shard could have many of the techniques for normal clusters used, but they are used independently, so any bottlenecks are for a smaller data set.  Balancing the load between servers can be difficult.  For example, some users may use more resources, thus using some shards much more than others.

    1. Inter-node Communication

There is no inter-node communication with sharding, and problems on one server can't affect other servers.

    1. Permanent Storage

Sharding splits the data into partitions/shards (i.e. by groups of users) and puts each portion on a completely separate DB system.  This removes write bottlenecks across shards without the potential for data inconsistency.  There is no specific storage architecture since each shared can use any storage architecture.

    1. Client Interaction

The client either has to know the sharding algorithm (so that the client knows which shard to contact), or there must be a shard lookup service that the client uses.  The shard lookup service will have the potential to be both a single point of failure, and a bottleneck (however, it should be used much less so that it isn't a bottleneck).  Determining the correct shard to contact can increase the complexity of the client code.  Also, queries across groups are more difficult, therefore, systems that require a lot of queries across shards may not work well with a sharding system.

    1. Further Information

http://highscalability.com/unorthodox-approach-database-design-coming-shard

  1. Conclusion

Sharding has the highest maintenance availability, MySQL cluster has the lowest maintenance availability.  Sharding has the lowest inter-node communication, Oracle cluster has the highest inter-node communication.  MySQL cluster has the most fault-tolerant permanent storage, sharding has the least fault-tolerant.  MySQL cluster has the least complex client interaction, Oracle RAC has the most complex.  Depending on the exact features needed, and the type of data used, each could be the best solution  However, for this focused comparison, MySQL cluster provides the best combination.

No comments: