Data Sharding – Way for SQL Azure

March 9 2011, 11:46 AM

“Data Sharding” is a concept of slicing your database into multiple smaller size databases providing faster access to the data. The term “Data Sharding” is coined by google engineers and popularized through their publication.
“Data Sharding” approach is on rise since the birth of SQL Azure due to its inherent size limitation to few gigabytes. Since enterprises requires to store tera bytes and peta bytes of data, it is utmost important to cater to the needs of enterprises even by living with the size limitation of SQL Azure.
Database sharding has multiple advantages. Few of them are listed below:
Ø Increased maintainability
Sharding divides your big database into multiple smaller ones increasing the overall manageability of your databases.
 
Ø Faster Query response time
Database queries can now be retrieved faster with distribution of database transactions across multiple smaller databases.
 
Ø Better DB availability
Sharding across multiple databases increases the chances of higher availability than being a single large database.
 
Ø Increase work throughput
Parallelizing work across multiple shards increases the overall work throughput.
 
The goal of this white paper is to provide perspective on Data Sharding approach by discussing various data sharding patterns and key practical considerations for successful sharding implementation on SQL Azure.
 


 
Let us first understand the need of Data Sharding in general.
Data Sharding is highly proven approach for improving the data access throughput and performance for large enterprise databases.
Over the period of time, enterprise databases grow exponentially. This steep growth of the database results in lowering the application performance exponentially due to increased database transactions.
Database throughput always depends upon CPU usage and memory of the machine. The increase in database transactions and size increases the CPU, disk and memory usage on the machine. This results in decrease of database performance due to increased query time. Adding more cores and memory to the machine can solve the problem to a certain extend. Above one point therefore, there is a need for a sophisticated solution such as “Data Sharding”.
Social networking sites such as Facebook, LinkedIn, Blogger are best suited example for data sharding. Tera bytes and Peta bytes of data is being stored on these sites today. Thousands of database transactions (read/write) gets executed every single second. In such scenario there are two options, viz. beef up your database server vertically so as to support these huge amount of data or shard your database across multiple database servers. The first option is more costlier and has its intrinsic limitations as till what we can beef up the resources vertically. The later provides much better option with low cost and better maintainability.

Thus to have the better performance for mission critical data centric applications, “Data Sharding” provides a better solution.

laxxal.createblog.com/photos/photo.php

Figure 1: Data sharding Concept
In Context of SQL Azure, it is also important to note that, the database size is limited to few Gigabytes. In such scenario, it is must to have the enterprise database sliced enough so as to support SQL Azure.
There are multiple patterns of sharding. You need to choose what is suitable and optimized for your scenario. Most commonly use pattern is horizontal slicing of the database.
This strategy is based on slicing the data horizontally in multiple smaller databases based on a particular theme. For example, slicing can be done at demo graphics level so that all the data for a particular demo graphics can be separated in the separate databases. For such approach sharding strategy should be devised to have dynamic retrieval of partition (or shard key) at application level.
The metadata information such as for each of the shard should be stored and retrieved dynamically in the application layer. This can be achieved by storing the shard database information that includes shard database location and corresponding sharded tables’ workset (i.e. record) information.
It is recommended that the relational information for a slice should be in the same database as parent on the same server for easy retrieval. For e.g. slicing on customer table on each of the customer, the information related to individual customer (say orders of the customer) must reside in the same database as the customer in, thus providing easy way to retrieve the information and avoid any cross-linked server calls. Thus, this mandates that the related foreign key tables must also be sharded for ease of the retrieval.

 

laxxal.createblog.com/photos/photo.php

Figure 2: How does sharding works

It is always recommended to have copy of static (or master) tables in all shards to avoid any cross db / linked server calls. This will not only reduce the overall complexity of sharding but also will improve the query performance.
Key advantages of this pattern are:
Ø Better control over sharding
Ø More loosely coupled approach
Ø Re-sharding further possible in case of unexpected DB growth provided initial sharding done carefully
Key dis-advantages of this pattern are:
Ø Need careful design otherwise leads to unbalanced load
 
You can also think of slicing your database vertically at table or column level.
Table level partitioning typically requires separating tables of a particular business feature in a separate shard. While this provides an advantage of cleaner separation, over the period of time it becomes a unmanageable when the DB size increases. So this is mostly suitable for non-static / readonly databases or small databases.
Column level partitioning is also seen sometimes a useful sharding approach when heavier columns can be separated out in a separate database. The best example of this is large blobs columns.
Key advantages of this pattern are:
Ø Better suitable for small DBs
Ø Easier for better maintainability
Key dis-advantages of this pattern are:
Ø Not suitable for large Enterprise DBs
Ø Further sharding may not be possible or may be possible with increased complexity
 
One can also think of mirroring the databases so that one database can server set of CRUD operations. For example, one mirror can server all read operations and other mirror can serve add, update and delete operations (master partition). This way of partitioning can also improve performance if the load of CRUD operations can be divided symmetrically.
This model is also called as Master/Slave model where master serves all add, update and delete operations and slave(s) serves all read operations. It should be noted that there has to be synchronization between master and slaves for reflecting correct set of data in slaves. This will increase the system maintenance overhead. Also the master servers must also be mirrored and clustered so as to reduce the bottleneck and failures.
Advantages:
Ø Performance may be increased compared to other options if designed carefully
Ø Availability may be increased compared to other options
Dis-advantages:
Ø Slaves (Read servers) are not guaranteed to have up-to-date data at the time of read operations due to lag in the synchronization leading to read of dirty data
Ø Master server can easily become a bottleneck in case of not clustered or mirrored
Ø May not be suitable for OLTP databases due to nature of transaction updates
 
There could be multiple other patterns to shard your databases. Above discussed patterns are sample patterns and one has to do detailed analysis before choosing the right pattern that best fits his / her scenario.


 
Having understood about Data Sharding, Let us understand how it can be applied to SQL Azure scenario.
As it may be noted that SQL Azure limits the size of each database to few giga bytes, Data Sharding is the best possible way out for SQL Azure.
Few key considerations are listed below while sharding in SQL Azure.
Ø Choose your sharding scheme carefully so as to not have distributed queries. This is required on SQL Azure as it does not support distributed queries.
 
Ø Carefully consider your application performance needs before creating shards as sharding may hit the performance. Queries may need to be optimized at application layer. This also needs to be evaluated before selecting the affinity groups for your SQL Azure DB.
 
Ø SQL Azure does not support distributed transactions. Thus any of your code using SqlTransation class will fail in SQL Azure scenario. Consider using TransationScope object and System.Transactions namespace instead of SqlTransaction at application level to mimic the distributed transactions functionality in SQL Azure scenario.
 
Ø Consider your data compliance requirement. You may choose to keep your private data on-premise in the SQL Server database that may be required due to compliance requirement and shard out the rest of public data in SQL Azure.
 
Ø Consider database growth estimates over several years (at least 3-5 years) before choosing the sharding scheme.
 
Ø Sharding mandates to have global temporary table(s), if used, to be available in each shard. SQL Azure does not support global temporary tables and thus any of such code must be carefully considered for replacement.
 
Ø You may have to change your application logic so as to fulfill application’s reporting needs as it might require fetching the data now from multiple shards. You may also look at de-normalizing reporting part of data into a separate shard as an alternative option. This will further require synchronization between this reporting shard and master shards.
 
 
 


 
While sharding provides better performance and maintainability, in reality it increases the complexity of the application exponentially too.
The primary change sharding poses in the application is the way you access your database. You need to review your data layer classes for the change in data access.

For example, till now your application might be accessing the database as below:

laxxal.createblog.com/photos/photo.php

This should now be changed to something like below:

laxxal.createblog.com/photos/photo.php

This single change practically poses multiple complexities. Following key considerations need to be studied carefully:
Ø Manage data size and data with Referential Integrity
One of the key considerations one has to take into account is how to manage the data size considering the database growth aspect and manage the referential integrity (RI) of the data.
One of the major objectives of usage of relational database in business applications is to ensure RI within the data. Slicing this data across different shards, must be therefore done carefully so as to manage the RI in the data. Cross shards RI is not allowed. Thus sharding scheme needs to be chosen such a way that this is avoided, in absence of which, RI has to be managed manually at application level.
Ø Persist Database connection information for shards
You need to also consider how to persist different database connection information for different shards with corresponding sharding information. For e.g. in above example database connection is for specific customer.
 
Ø Avoid Cross shard Joins
Since now information will be split across multiple databases, you need to carefully consider the sharding scheme so as to avoid cross shard calls. Cross shard calls are resource intensive and will have impact on the overall query performance. In case of SQL Azure, since it does not support distributed queries, sharding should carefully designed to take care of this issue.
 
Ø Availability & Reliability of the shard databases
Production applications demands higher availability and reliability. Database tier is one of the most critical segments of the architecture. Therefore it is also important to make your shards as highly available and highly reliable as single database would be.
Considering this, you need to think of having database mirroring and clustering in place for each of your shard. Currently SQL Azure does not support database mirroring. You may need to manage it on your own by having two copies of databases and synchronizing both of them out of application process.
 
Ø Management of Auto-increment key
Your application might be using auto-increment key columns that may be generating sequential key values for different rows. In case of database sharding it is important to note that these columns need to be managed and to provide a manual key generation method at application level instead of database level so as to keep keys unique across all shards.
 
Ø Design carefully so as to Shard at granular level in future
Enterprise databases can grow exponentially out of the initial estimates. Therefore you need to choose your sharding scheme to have a room for further sharding possible. This way you will be able to manage the database sharding seamlessly.
 


 
There are multiple patterns available on Data Sharding. It is up to individual to choose the correct sharding scheme that suits his / her scenario best.
Data Sharding does provide certain advantages over traditional approach of beefing up database server vertically. However it does increases the complexity of the overall system multifold. It may not be always suitable to shard your database or always shard in particular way.
While working with SQL Azure you have to consider the Data Sharding as primary approach for data management in case of your data is large. While doing this, few key considerations may be observed as described in this article. In addition, multiple practical considerations should also be observed before designing data sharding approach for particular scenario.

Posted in Architecture

0 comments

No comments yet. Be the first one to comment!

Cloud Computing – A Paradigm Shift

March 4 2011, 4:57 AM

While enterprises are compelled to reduce cost and their IT spending day by day, to meet this growing business needs, IT industry is going through a paradigm shift in different technological areas and breakthrough innovations.
“Cloud Computing” is another such evolution IT industry is banking upon that gives a whole new dimension to Web Computing.
Everyone these days seems to be talking about "Cloud Computing", but there is no clear cut definition available in the market. This paper meant to throw some light on the meaning of Cloud Computing to business and IT community.
The Web technology has evolved over the period of time from merely publishing the static information to the dynamic information sharing what we see today. This flexibility from Internet is now transforming into the new era of “Cloud Computing” where business can take advantage of what is already on internet in day to day life and need not have to really own that piece of the service.
What is it exactly?
Cloud computing is a evolving concept and a style of computing where tasks are assigned to immensely scalable IT services that are hosted on Internet and that exposes various IT capabilities to end customers uniformly such as application services, business services, storage, network etc.
CC can take advantage of IT resources over the cloud as well (out side enterprise), when need arises for e.g. may be a cart service can talk with payment service on the cloud. This type of computing even can use other resources such as SANs, network equipments. Operating at cloud level gives users a flexibility to scale vertically and thus provides easy access to the services using a very thin client such as Blackberry or iPhone. For this same reason alternatively it is also termed as “on-demand computing”. So on and all this new way of computing have a great potential and future for sure.
Wikipedia defines it correctly as
Cloud computing is Internet (“Cloud”) based development and use of computer technology (“Computing”). It is a style of computing where IT-related capabilities are provided “as a service” allowing users to access technology-enabled services from the Internet ("in the cloud") without knowledge of, expertise with, or control over the technology infrastructure that supports them.
It is important to note that any service or application over the internet is not a cloud. But it has to meet all the characteristics of “Cloud Computing” and few on the important ones can be described as below:
  1. A service or application over available over the internet (“Cloud”) and no special installation required.
  2. A service or application is immensely scalable on-demand.
  3. Services must be highly available.
  4. Provides Pay-as-you-go model for on-demand computing to the customers.
  5. The service must be a guaranteed delivery over the internet and not just merely using the internet.
Cloud computing is often confused as grid computing, but in reality it compliments it. Taking advantage of Grid Computing in cloud computing can give it natural power of super computer.
Essentially there are two types cloud can exist into as obvious; Public Cloud and Private Cloud.
As definition suggests, Public cloud is open for public to access, designed for immense salability. Private cloud is been built for an enterprise, providing more security.
The notable difference between them is the upfront investment an enterprise has to do. Public cloud generally based on pay-as-go-model giving enterprise a huge cost advantage upfront. Even though this is true, over the time private cloud can be a cost winner. Public cloud, since in public domain, will not provide access to you for your data whereas Private cloud control is with you, thus providing control on your own data. With increasing capabilities of public cloud day over day, vendors are working hard to provide such control to the customers even in Public Cloud.
The choice of a particular type is completely lies with enterprises, though public cloud is more suited for startups and SMB segment.
 
We can think of Cloud Computing consisting two components viz. Cloud Platform and Cloud Applications.
 
As the name suggests, this platform allows developers to host the applications and services in cloud or take use of services from cloud.
Think of the current scenario where basic infrastructure such as operating systems, hosting engines, network infrastructure, and storage space is required within the enterprise to host the applications in staging & production. Taking use of Cloud Computing Platform now eliminate all these needs. Unfortunately not many cloud platforms are available today and vendors are rising to this challenge. Microsoft’s Azure platform is one of such cloud platform that has capabilities such as store, compute & interact. The Windows Azure platform is based on Microsoft.NET framework version 3.5 and IIS v 7.0. It provides set of APIs for developers to create their cloud applications & services.
 
These are the applications in cloud hosted on cloud platforms such as Windows Azure.
A typical IT landscape for on-premises applications looks like this:
The on-premises solution landscape for any enterprise consists of:
  • Infrastructure – Consists of Platform, Operating System, Network resources, Development Tools etc.
  • Core Business applications & Services – Business application services that are hosted on the Infrastructure
  • Peripheral Applications & Services – Such as monitoring tools, common services etc.
  • Storage / Integration / Workflow Services – Such as databases, Integration servers, workflow services
  • Entitlement – Authentication & Authorization
Cloud computing provides exact similarities in addition to providing its own benefits.
As the cloud awareness is increasing day by day among enterprises, the field is open for an competition already.
One of the most discussed and powerful Cloud Platform today is Microsoft’s Windows Azure. Azure is a full-fletched non-commercial operating system from Microsoft that has capabilities of Storage, Compute, & Manage. The Platform comprises of many cloud services components such as .NET services, SQL Services & Live Services.
One of the widely used available Cloud Operating System today is Amazon’s Elastic Compute Cloud (EC2). EC2 provides way to specify how many Linux instances running on Virtual Machines customer requires. This is achieved through a commercial web service. It uses a technology viz. “Xen Virtualization”.
GoGrid is another vendor that provides cloud platform. Cloud platform services also provide windows support.
Microsoft also provides a CRM cloud platform through their Dynamic CRM Live Offering.
 
Cloud applications are the application services in cloud. Cloud application services are generally confused with SaaS but in reality they are not SaaS.
So what is the difference between SaaS and Cloud Applications?
SaaS is a first step towards Cloud Applications.
1.       SaaS typically resides in private cloud out of the enterprises and is more customized for a customer or set of customers; whereas Cloud applications are typically make use of public cloud and uniform in nature for different customers. Best example of cloud applications can be given as Salesforce.com.
2.       Cloud Applications can scale multiple times than SaaS applications since they are designed to run on one to many machines that gives them immense power to scale and you can add as many virtual servers as you need.
 
There are immense benefits of this approach. It gives direct benefits to business including profitability improvements.
1.       There is no on-premises IT infrastructure or installation required and hence a huge IT cost reduction.
2.       It gives enterprises a flexibility to use it need basis and no overheads for IT maintenance and hence operational efficiency is achieved.
3.       Since there is no infrastructure or IT tools required to be maintained by the enterprises, the obvious labor cost attached to it is saved.
4.       It can offer virtually unlimited computing power which can be utilized for large operations for e.g. search an email from within GBs of data.
5.       Users still have control on their Cloud Infrastructure, Application and Data.
 
On in all, having gone through all these discussions here are the takeaways from this paper:
-          Cloud Computing is a style of technology for future and it will change the way we develop our IT
-          Cloud Computing is surely a next generation approach towards an efficient IT solution and enterprises should start looking at it to take early benefits of it
-          Cloud is based on SOA model, thus provides more flexibility, agility to the business
-          Provides on-demand computing model giving huge cost advantage to enterprises
-          Immensely scalable providing high computing power and high availability

Posted in Architecture

1 comments

Profile

laxxal
  • Male
  • 35 years old

Statistics

Entries 3
Comments 1
Page views 1,600
Last update Mar 9, 2011