Data Sharding – Way for SQL Azure (go back »)
March 9 2011, 11:46 AM
“Data Sharding” is a concept of slicing your database into multiple smaller size databases providing faster access to the data. The term “Data Sharding” is coined by google engineers and popularized through their publication.
“Data Sharding” approach is on rise since the birth of SQL Azure due to its inherent size limitation to few gigabytes. Since enterprises requires to store tera bytes and peta bytes of data, it is utmost important to cater to the needs of enterprises even by living with the size limitation of SQL Azure.
Database sharding has multiple advantages. Few of them are listed below:
Ø Increased maintainability
Sharding divides your big database into multiple smaller ones increasing the overall manageability of your databases.
Ø Faster Query response time
Database queries can now be retrieved faster with distribution of database transactions across multiple smaller databases.
Ø Better DB availability
Sharding across multiple databases increases the chances of higher availability than being a single large database.
Ø Increase work throughput
Parallelizing work across multiple shards increases the overall work throughput.
The goal of this white paper is to provide perspective on Data Sharding approach by discussing various data sharding patterns and key practical considerations for successful sharding implementation on SQL Azure.
Let us first understand the need of Data Sharding in general.
Data Sharding is highly proven approach for improving the data access throughput and performance for large enterprise databases.
Over the period of time, enterprise databases grow exponentially. This steep growth of the database results in lowering the application performance exponentially due to increased database transactions.
Database throughput always depends upon CPU usage and memory of the machine. The increase in database transactions and size increases the CPU, disk and memory usage on the machine. This results in decrease of database performance due to increased query time. Adding more cores and memory to the machine can solve the problem to a certain extend. Above one point therefore, there is a need for a sophisticated solution such as “Data Sharding”.
Social networking sites such as Facebook, LinkedIn, Blogger are best suited example for data sharding. Tera bytes and Peta bytes of data is being stored on these sites today. Thousands of database transactions (read/write) gets executed every single second. In such scenario there are two options, viz. beef up your database server vertically so as to support these huge amount of data or shard your database across multiple database servers. The first option is more costlier and has its intrinsic limitations as till what we can beef up the resources vertically. The later provides much better option with low cost and better maintainability.
Thus to have the better performance for mission critical data centric applications, “Data Sharding” provides a better solution.
Figure 1: Data sharding Concept
In Context of SQL Azure, it is also important to note that, the database size is limited to few Gigabytes. In such scenario, it is must to have the enterprise database sliced enough so as to support SQL Azure.
There are multiple patterns of sharding. You need to choose what is suitable and optimized for your scenario. Most commonly use pattern is horizontal slicing of the database.
This strategy is based on slicing the data horizontally in multiple smaller databases based on a particular theme. For example, slicing can be done at demo graphics level so that all the data for a particular demo graphics can be separated in the separate databases. For such approach sharding strategy should be devised to have dynamic retrieval of partition (or shard key) at application level.
The metadata information such as for each of the shard should be stored and retrieved dynamically in the application layer. This can be achieved by storing the shard database information that includes shard database location and corresponding sharded tables’ workset (i.e. record) information.
It is recommended that the relational information for a slice should be in the same database as parent on the same server for easy retrieval. For e.g. slicing on customer table on each of the customer, the information related to individual customer (say orders of the customer) must reside in the same database as the customer in, thus providing easy way to retrieve the information and avoid any cross-linked server calls. Thus, this mandates that the related foreign key tables must also be sharded for ease of the retrieval.
It is always recommended to have copy of static (or master) tables in all shards to avoid any cross db / linked server calls. This will not only reduce the overall complexity of sharding but also will improve the query performance.
Key advantages of this pattern are:
Ø Better control over sharding
Ø More loosely coupled approach
Ø Re-sharding further possible in case of unexpected DB growth provided initial sharding done carefully
Key dis-advantages of this pattern are:
Ø Need careful design otherwise leads to unbalanced load
You can also think of slicing your database vertically at table or column level.
Table level partitioning typically requires separating tables of a particular business feature in a separate shard. While this provides an advantage of cleaner separation, over the period of time it becomes a unmanageable when the DB size increases. So this is mostly suitable for non-static / readonly databases or small databases.
Column level partitioning is also seen sometimes a useful sharding approach when heavier columns can be separated out in a separate database. The best example of this is large blobs columns.
Key advantages of this pattern are:
Ø Better suitable for small DBs
Ø Easier for better maintainability
Key dis-advantages of this pattern are:
Ø Not suitable for large Enterprise DBs
Ø Further sharding may not be possible or may be possible with increased complexity
One can also think of mirroring the databases so that one database can server set of CRUD operations. For example, one mirror can server all read operations and other mirror can serve add, update and delete operations (master partition). This way of partitioning can also improve performance if the load of CRUD operations can be divided symmetrically.
This model is also called as Master/Slave model where master serves all add, update and delete operations and slave(s) serves all read operations. It should be noted that there has to be synchronization between master and slaves for reflecting correct set of data in slaves. This will increase the system maintenance overhead. Also the master servers must also be mirrored and clustered so as to reduce the bottleneck and failures.
Advantages:
Ø Performance may be increased compared to other options if designed carefully
Ø Availability may be increased compared to other options
Dis-advantages:
Ø Slaves (Read servers) are not guaranteed to have up-to-date data at the time of read operations due to lag in the synchronization leading to read of dirty data
Ø Master server can easily become a bottleneck in case of not clustered or mirrored
Ø May not be suitable for OLTP databases due to nature of transaction updates
There could be multiple other patterns to shard your databases. Above discussed patterns are sample patterns and one has to do detailed analysis before choosing the right pattern that best fits his / her scenario.
Having understood about Data Sharding, Let us understand how it can be applied to SQL Azure scenario.
As it may be noted that SQL Azure limits the size of each database to few giga bytes, Data Sharding is the best possible way out for SQL Azure.
Few key considerations are listed below while sharding in SQL Azure.
Ø Choose your sharding scheme carefully so as to not have distributed queries. This is required on SQL Azure as it does not support distributed queries.
Ø Carefully consider your application performance needs before creating shards as sharding may hit the performance. Queries may need to be optimized at application layer. This also needs to be evaluated before selecting the affinity groups for your SQL Azure DB.
Ø SQL Azure does not support distributed transactions. Thus any of your code using SqlTransation class will fail in SQL Azure scenario. Consider using TransationScope object and System.Transactions namespace instead of SqlTransaction at application level to mimic the distributed transactions functionality in SQL Azure scenario.
Ø Consider your data compliance requirement. You may choose to keep your private data on-premise in the SQL Server database that may be required due to compliance requirement and shard out the rest of public data in SQL Azure.
Ø Consider database growth estimates over several years (at least 3-5 years) before choosing the sharding scheme.
Ø Sharding mandates to have global temporary table(s), if used, to be available in each shard. SQL Azure does not support global temporary tables and thus any of such code must be carefully considered for replacement.
Ø You may have to change your application logic so as to fulfill application’s reporting needs as it might require fetching the data now from multiple shards. You may also look at de-normalizing reporting part of data into a separate shard as an alternative option. This will further require synchronization between this reporting shard and master shards.
While sharding provides better performance and maintainability, in reality it increases the complexity of the application exponentially too.
The primary change sharding poses in the application is the way you access your database. You need to review your data layer classes for the change in data access.
For example, till now your application might be accessing the database as below:
This should now be changed to something like below:
This single change practically poses multiple complexities. Following key considerations need to be studied carefully:
Ø Manage data size and data with Referential Integrity
One of the key considerations one has to take into account is how to manage the data size considering the database growth aspect and manage the referential integrity (RI) of the data.
One of the major objectives of usage of relational database in business applications is to ensure RI within the data. Slicing this data across different shards, must be therefore done carefully so as to manage the RI in the data. Cross shards RI is not allowed. Thus sharding scheme needs to be chosen such a way that this is avoided, in absence of which, RI has to be managed manually at application level.
Ø Persist Database connection information for shards
You need to also consider how to persist different database connection information for different shards with corresponding sharding information. For e.g. in above example database connection is for specific customer.
Ø Avoid Cross shard Joins
Since now information will be split across multiple databases, you need to carefully consider the sharding scheme so as to avoid cross shard calls. Cross shard calls are resource intensive and will have impact on the overall query performance. In case of SQL Azure, since it does not support distributed queries, sharding should carefully designed to take care of this issue.
Ø Availability & Reliability of the shard databases
Production applications demands higher availability and reliability. Database tier is one of the most critical segments of the architecture. Therefore it is also important to make your shards as highly available and highly reliable as single database would be.
Considering this, you need to think of having database mirroring and clustering in place for each of your shard. Currently SQL Azure does not support database mirroring. You may need to manage it on your own by having two copies of databases and synchronizing both of them out of application process.
Ø Management of Auto-increment key
Your application might be using auto-increment key columns that may be generating sequential key values for different rows. In case of database sharding it is important to note that these columns need to be managed and to provide a manual key generation method at application level instead of database level so as to keep keys unique across all shards.
Ø Design carefully so as to Shard at granular level in future
Enterprise databases can grow exponentially out of the initial estimates. Therefore you need to choose your sharding scheme to have a room for further sharding possible. This way you will be able to manage the database sharding seamlessly.
There are multiple patterns available on Data Sharding. It is up to individual to choose the correct sharding scheme that suits his / her scenario best.
Data Sharding does provide certain advantages over traditional approach of beefing up database server vertically. However it does increases the complexity of the overall system multifold. It may not be always suitable to shard your database or always shard in particular way.
While working with SQL Azure you have to consider the Data Sharding as primary approach for data management in case of your data is large. While doing this, few key considerations may be observed as described in this article. In addition, multiple practical considerations should also be observed before designing data sharding approach for particular scenario.
In Architecture
Comments
Displaying 0 - 0 of 0 comments
No comments yet. Be the first one to comment!
Add Comment
You must be logged in to comment
Statistics
Comments | 0 |
Page views | 2886 |