How to Shard with Windows Azure SQL Database

How to Shard with Windows Azure SQL Database

Database sharding is a technique of horizontal partitioning data across multiple physical servers to provide application scale-out. Windows Azure SQL Database is a cloud database service from Microsoft that provides database functionality as a utility service, offering many benefits including rapid provisioning, cost-effective scalability, high availability and reduced management overhead. SQL Database combined with database sharding techniques provides for virtually unlimited scalability of data for an application. This paper provides an overview of sharding with SQL Database, covering challenges that would be experienced today, as well as how these can be addressed with features to be provided in the upcoming releases of SQL Database.


 This topic is a how to.
Please keep it as clear and simple as possible. Avoid speculative discussions as well as a deep dive into underlying mechanisms or related technologies.

 Note
If you wish to contribute to this page, use the Edit tab at the top (sign-in required). If you wish to provide feedback for this documentation please either send e-mail to azuredocs@microsoft.com or use the Comment field at the bottom of this page (sign-in required).

Table of Contents

  1. Introduction
  2. An Overview of Sharding
  3. A Conceptual Framework for Sharding with SQL Database
  4. Application Design for Sharding with SQL Database and ADO.NET
  5. Applying the Prescribed ADO.NET Based Sharding Library
  6. SQL Database Federations
  7. Summary
  8. See Also

1 Introduction

This document provides guidance on building applications that utilize a technique of horizontal data partitioning known as sharding, where data is horizontally partitioned across multiple physical databases, to provide scalability of the application as data size or demand upon the application increases.

A specific goal of this guidance is to emphasize how SQL Database can facilitate a sharded application, and how an application can be designed to utilize the elasticity of SQL Database to enable for highly cost effective, on-demand, and virtually limitless scalability.

This guidance will present an overview of patterns and best practices for implementing sharding using SQL Database. Introduced are two concepts that both use .NET and SQL Database to assist with implementing sharding, an ADO.NET based provider for sharding, and extensions to SQL Database for shard federations.

This guidance will focus on providing a framework for understanding how to perform sharding today with SQL Database and how it will evolve in the near future. Future guidance will address additional facets of sharding in greater detail.

Specifically, this guidance will focus on:

  • The basic concepts involved in horizontal partitioning and sharding
  • The challenges when sharding an application
  • Common patterns for implementing sharding
  • The benefits of using SQL Database as your application’s sharding infrastructure
  • A high level design of an ADO.NET based sharding library, and
  • An introduction to SQL Database Federations which adds sharding capabilities directly into SQL Database

1.1 Sharding: The Horizontal Scaling of Data for an Application

Sharding is an application pattern for improving the scalability and throughput of large-scale data solutions. To “shard” an application is the process of breaking an application’s logical database into smaller chunks of data, and distributing the chunks of data across multiple physical databases to achieve application scalability. Each physical database in this architecture is what is referred to as a shard.

In a sharded application as is outlined in this guidance, it is the rows of a logical database that are spread across separate physical databases. This differs from a solution that is vertically partitioned by putting the entire set of data entities into a separate physical database, such as putting orders and customers in different databases. The partitioning of data by value (the shard key) provides for the potential of greater scalability than a vertically partitioned design by being able to continuously break the data into more chunks that are spread across an increasing amount of shards as increasing demand requires.

The proposed implementation will map data to specific shards by applying one or more strategies upon a “sharding key” which is the primary key in one of the data entities. Related data entities are then clustered into a related set based upon the shared shard key and this unit is referred to as an atomic unit. All records in an atomic unit are always stored in the same shard.

The atomic unit forms a core architectural concept that is used by the sharding infrastructure as a basis for consistent access of data as shards are dynamically added and removed from an application. The sharding infrastructure will ensure that all related entities in an atomic unit are always within the same physical shard. This facilitates joins and aggregations, and will also be a key benefit of SQL Database Federations that will eventually be implemented into the sharding infrastructure, enabling consistent elastic scale up and down of the sharding solution.

By focusing on the atomic unit as a consistent piece of data, the sharding infrastructure will eventually be able to take action automatically based upon rules to add or remove shards to handle increasing or decreasing demand. As shards are added and removed, atomic units will need to move between shards to optimize data access. During the process of moving data the sharding infrastructure can ensure that any specific atomic unit of data will always be available either in its current or new shard with the location transparent to the application.

1.2 The benefits of sharding with SQL Database

Sharding as a concept has been available in several forms for years and has mostly focused on implementations based upon self-hosted infrastructure that address the following needs:

  • Scale out using tens, hundreds or thousands of database nodes using commodity hardware instead of expensive scale-up systems,
  • To achieve scalable performance as the number of nodes increases, and
  • Build a solution with an excellent price-performance ratio derived from the use of commodity hardware instead of expensive application servers

It is possible to build a self-hosted sharding solution, and indeed most to date have been built in this manner, but self-hosted sharding solutions also have several significant issues:

  • Scale-out is done at the application level, and the application must manage all of the details of the partitioning of data (many of these topics are discussed later),
  • Redundancy and high availability is implemented at the application level instead of in hardware or the cloud fabric,
  • Rebalancing of shards is challenging and often an offline process,
  • Physical administration of the hardware and system level software becomes increasingly difficult as more databases are added, and
  • Capital expenditure on servers is prohibitive.

SQL Database provides a unique ability for solving these issues of building a self-hosted sharded application through the elastic capabilities provided by the Azure cloud platform as a service. Key benefits are:

  • All infrastructure is managed
       
    SQL Database abstracts the logical from the physical administration. SQL Database handles all of the physical level tasks such as server patching, hard drives, and storage while customers only need to handle the administration of the logical databases.
  • Elastic provisioning of nodes
    Creating a new shard and adding it to a sharded application can be performed by using Windows Azure Management Portal or through DDL. SQL Database eliminates the need to take months to procure, configure and deploy the new hardware and database systems. Additionally, applications that need tens, hundreds or even thousands of databases for a short period of time can do this seamlessly and then de-provision the databases when the demand drops.
  • Pay-as-you-go pricing
       
    SQL Database has a linear pricing model that is very attractive for sharding solutions as the amount per month per gigabyte of storage is linear as the database size increases. This allows customers to have the ability to very accurately predict the costs that will be incurred as a system grows (and shrinks). Also, because databases are available in different editions (web and business) with differing ceilings for size limits, different users can also have control over the granularity of the increased costs as a sharding solution expands in data size.
  • High availability
       
    SQL Database provides a high availability SLA of 99.9% for all databases, no need to implement RAID and other availability techniques yourself.

Cloud infrastructure does provide its own complexities, and building a sharded solution on today’s SQL Database requires careful design given their unique complexities:

  • Maximum resource limitations on individual SQL Databases
    There are practical and technical limitations imposed in SQL Database on the maximum data set size in a single database. While these sizes will likely continue to increase over time, they will still remain relatively small compared to the overall size required of some sharded solutions.
  • Multi-tenancy performance throttling and connection management
       
    SQL Databases can be performance throttled in effort to provide a minimum quality of service level. Queries issued by an application that consume more than their fair share of a tenant will be terminated by SQL Database, and since SQL Database is multi-tenant queries issued by other application could also impact the application’s performance. SQL Database also terminates idle sessions at regular intervals to preserve system resources, thereby requiring an application to be able to account for automatic session recovery. As shards are being added and removed dynamically, managing connections becomes a challenge, as an application must be able to reestablish connections as the set of shards changes.
  • Sharding with the current SQL Database is application level
    The current version of SQL Database does not have any explicit support for sharding, and therefore the application is currently responsible for all facets of sharding until the capabilities are in SQL Database Federations.
  • Rebalancing of shards is an offline process
    The addition or removal of shards from a sharded application can be a complicated process, as rules for finding data must be changed as the physical infrastructure is modified. In addition to the previously mentioned issues with connections, there are also data level issues such as key management that may require rewriting of keys as data is moved between shards. This often leads to the fact that sharding solutions, even with SQL Database being able to create databases almost instantly, still may need to go offline for an in-determinant duration as data is adapted to the new federation structure.

To address these complexities in effort to facilitate elastic sharding, upcoming releases of SQL Database will add support directly into the cloud infrastructure to manage the dynamic addition of shards and movement of atomic units of data consistently and transparently between shards. These capabilities are currently referred to as SQL Database Federations (described in section 6) with an outline of the concepts in Figure 1 below.

Figure 1: SQL Database Federations

Until Federations are added to SQL Database (and even after) there are still concerns that warrant the use of an application framework to facilitate sharding to offload many of these responsibilities from the application. In this document we’ll refer to the application framework as an ADO.NET based sharding library.

Creating a custom ADO.NET based sharding library combined with functionality to be provided in later releases of SQL Database including SQL Database Federations, will provide the most advanced, flexible, scalable and economic sharding application infrastructure available.

1.3 The Remainder of the Guidance

The remainder of this guidance will give a high-level overview of various concepts and patterns in any sharded application, and provide guidance on specific techniques commonly used to manage sharded data in SQL Database. This will be extended into a discussion of techniques using ADO.NET and SQL Database for sharding, followed with more detailed examples and a brief introduction of the features to be added to SQL Database when it supports sharding.


2 An Overview of Sharding

Sharding is an application pattern where the application partitions its data horizontally across multiple databases and for purposes of this guidance across multiple SQL Databases.

This section introduces a common sharding archetype often referred to as a multi-master sharding pattern that will be used as a reference model for the remainder of the guidance.

2.1 The Multi-Master Sharding Archetype

Figure 2: A Multi-master sharding archetype

 

In this sharding archetype all database shards are considered read/write, there is no inherent replication of data between shards, and the model is referred to as “shared nothing” since no data is shared or replicated between shards. An application would want to use a multi-master archetype if:

  • Clients of the system need read / write access to data
  • The application needs the ability for the data to continuously grow
  • Linear scalability of data access times is needed as data size increases
  • Data written to one shard must be immediately accessible to any client

In a multi-master archetype, because no data is shared or replicated across shards, the application is completely responsible for knowing how to distribute all database operations to the appropriate shards, and in how to manage serial or parallel access to all the shards. This can become challenging to manage as the size of the database and number of shards increases. Many of these complexities are discussed in detail in section 2.2, since this archetype presents the most challenges it is best used to explain issues that an application or sharding provider needs to address.

2.2 Challenges of Sharding

Sharding can provide great scalability to an application, but it also complicates the application for many reasons. An application has to distribute its data across multiple database servers, and coordinating access to the servers requires the application to take on significant responsibility for federating data access across the shards to coordinate data access.

In a sense, an application that takes on sharding its data becomes an orchestration service of data across the shards, which unfortunately distracts the focus of the application away from the problem domain it is supposed to address and into implementation details of sharding.

The following are the issues that commonly arise when building a sharded application of any significant scale:

  • Distributed Data Access
    The primary concern of a sharding solution is the scalable distribution and retrieval of data across multiple database instances. This distributed nature of the architecture affects all data access methods (create, read, update and delete) and the sharding system must have the ability to know how to manage data manipulation across multiple databases. This knowledge centers on the application being able to:
    • Determine which shard(s) a new entity should be stored in, and
    • How the application remembers which shard an entity is stored in so it can be efficiently retrieved, updated or deleted at a later time
  • Determining which shard to use to store a new piece of data
    Faced with a multitude of shards in which to write a piece of data, which would the application choose? There are multiple strategies for making this decision, including load balancing, round-robin, hashing based upon a key and simply to write the data to all of the shards. Each of these techniques has different design, performance and operational issues that must be dealt with by an application.
  • How to efficiently retrieve stored data
    Retrieval of data is complicated by needing to determine the shard that holds the item and depends upon what types of rules that application uses to pick the shard in which to store the data. The application must somehow determine the key to the data and then deduce the shard via one of various algorithms or use brute force parallel queries to all shards (which is not particularly efficient).
  • ID Generation that can be used to efficiently store and retrieve data
    Storage and retrieval of items in a sharded application revolves around being able to take a piece of data and generate an ID for it which the system can use to efficiently store and retrieve the item from the shards. Options for this use various hashing and mapping functions, application specific ID generating algorithms, or with a sharding solution based ID generator that can handle the process on behalf of an application.
  • Parallel or sequential execution of queries across shards
    When needing to query multiple shards for a data item or a set of items, does the application query the shards sequentially or in parallel? Each has performance implications for the application, and the parallel pattern introduces the need to handle concurrency in the application (multithreaded requests for data). While at first parallel access seems like a natural way to retrieve data from all the shards, if the target data is only in one shard then all of the other shards utilize resources unnecessarily by handling requests that they do not need to answer.
  • Tracking of data back to its source shard
    If the application wants to update or delete data that has been retrieved from the shards, it must determine from the retrieved data which shard it originated from to efficiently handle the change. It is often possible to simply send every shard the command where each shard that is not responsible simply filter (by key) and ignore the request, but this can be very inefficient especially if the target item is only in a single shard. A sharding framework therefore should provide a means of being able to track every piece of data back to its originating shard.
  • Joins across shards
    In a sharded application, queries or other statements using joins that span shards are complex and challenging to perform efficiently. Several techniques are possible to perform cross-shard joins, but as the number and elasticity of shards increases it often becomes difficult for the application to know which shards to use or where they exist at any given time to be included in a join.

    Many applications that manage their own sharding address this by avoiding cross-shard joins and by adopting several other design techniques that in essence trade off space for time, as in an elastic sharded solution space is the commodity resource and time is what is valuable. The three most common techniques are:
    • Denormalization
      Denormalization assists by providing a means which data from several related entities are collapsed to a common entity structure where the related data is saved in the same record and hence the same shard.
    • Reference Tables
      An example of using reference tables is the case of relatively static lookup tables that are commonly utilized when joining with much larger primary tables. Tables containing values such as status codes, countries, types, and products fall into this category. The application therefore writes reference data to all the appropriate shards or utilizes a synchronization technology to replicate these tables across shards.
    • Entity clustering using a shard key and atomic data units
      An application that uses this technique stores related entities in the same shard. This is performed by the solution knowing the entity relationships and using a shared shard key in related entities to ensure that they are written to the same shard. This enables the use of joins across these specific entities since they are in the same shard.
  • Managing data relationships
    Data relationships can become difficult in sharded systems as joins across shards in SQL Database are not supported, and keys for data items must be generated in a way that works across multiple databases. Specific issues are:
       
    • Data relationships inherently require that joins be performed between different tables / entities. In a sharded solution it is likely that the tables for related entities will be in different shards creating the issues previously discussed with cross-shard joins.
    • Referential integrity cannot be ensured by the database if related entities with primary / foreign key relationships are stored across shards

Therefore applications built to manage sharding on their own address these concerns by adopting several techniques instead of relying on data relationships:

    • Extensive use of de-normalized data
    • System enforced storage of related data in the same shard
    • Manual mapping of relationships in relationship tables
    • Adoption of a loose consistency model where the application manages cleanup tasks in the various databases
  • Rebalancing the shards
    Eventually a single shard may fill up or performance may slow as the size of the data increases. When this occurs it may be decided to split one or more shards into separate shards to scale out the application. Or on the flip side, it can be decided that there are now too many underutilized shards, so it would be desired to collapse them into fewer shards so that fewer databases are utilized and cost can potentially be lowered.

    The rebalancing process is highly dependent upon the strategies for determining IDs and shard selection algorithms for the data. It may require restructuring those IDs as data is moved between shards. It is very likely that the application needs to be offline during the process, which is not desirable as the whole purpose of sharding is scalability and availability.

    To facilitate rebalancing, the ADO.NET based sharding library may implement a concept known as virtual shards that assists in isolating the impact of adding and removing shards, and new features in SQL Database for sharding will offer the ability to rebalance online directly within SQL Database.
  • Security
    Sharded applications utilize many databases that are distributed across multiple database nodes. Since the databases are on different nodes it is normally required to manage different security accounts on each server. This can be cumbersome as the amount of shards grows since account credentials need to be independently managed on each and every shard. It may also become very difficult to perform as shards are added and removed dynamically, as the actual connection information required for a particular piece of data may change as the data is moved between shards.

        SQL Database provides an elegant means of solving this issue. Using a single SQL Database account it is possible to create multiple databases that share the security credentials from the master database in the account. Each individual database in the account is created by SQL Database with its own redundancy provided by the fabric, with access via the same server name but with the catalog being mapped to different nodes, and with accounts for all replicated instances provided by the definitions in the master database. All that needs to be changed in the connection information for each shard is the catalog name of the specific shard.
  • Data Ingestion / Bulk Data Loading
    Many of the archetypes for sharding involve inserting bulk data from external sources into the sharding solution. But sharding complicates ingestion, as each entity in the load process must be sharded as it is inserted, so it is not as easy as just executing a bulk copy process into a single database. To make this work effectively it is necessary to write an application to manually perform the bulk load and sharding based upon the strategies defined for the data. This does have a benefit however as bulk insertions can be parallelized across the shards, making ingestions run much faster than with a single database.
  • Cross-Shard Sorting and Aggregations
    Sorting and aggregations are extremely complicated to execute in a sharded application because of the need to process the grouping query on each individual shard and then to reprocess each of the result sets again in the aggregate before passing the composite result back to the client.

    In the case of grouping commands such as SUM or AVERAGE, the individual groups of records from each shard will need to be identified, correlated with the same sets from the other shards, and merged. Additionally any order statements will then need to be executed on any composite result sets.

    SUM’s and AVERAGES’s represent two different classes of functions that need different types of post-processing. SUM-like functions are relatively straightforward in that they only need to be reapplied to the result columns. AVERAGES’s require additional information be passed back along with the resulting average. In this case, the average and the number of records needs to be returned so that the average from each shard can be weighted into the composite result.

    A best practice for these types of scenarios in sharded solutions is to realize that this is multistep process that is managed at the application level, executed against individual shards, with intermediate results potentially through temporary tables, and additional processing to be done at multiple stages depending upon the complexity of the expressions.

    Because of this sharded database solutions tend to be more complex for ad-hoc reporting scenarios. If these types of capabilities are required it is often best to perform extracts from the shards into a standard database, applying base grouping equations before insertion, and then doing more complex and ad-hoc queries upon the intermediate data store.
  • Backup and disaster recovery
    Due to the number of databases, and further complicated by the distributed nature of SQL Database, sharded solutions are difficult to backup. When using SQL Database for sharding there are a few means by which this process can be simplified:
       
    • SQL Database maintains three copies of a database to provide high availability and in the case of a node failure provides transparent failover. If backups or copies of a database are needed to provide failover, then these automatic capabilities may be enough to satisfy your needs.
    • SQL Database provides a means for database copy discussed in Copying Databases in Windows Azure SQL Database that can be used to backup individual shards. Also under consideration is the ability for SQL Database Federations to automate this across all (or an important subset) of the application’s shards to facilitate the process of backing up multiple shards. However, database copy only works within the same Azure data center and cannot be relied upon as a general disaster recovery procedure.
  • Distributed Transactions
       
    SQL Database supports transactions within a single shard, but distributed cross-database transactions are not supported. If distributed transactions are required within a sharded application, then it is better to denormalize the data model and store related data needed for a transaction within a single shard.
  • Connection Management
    A sharded application inherently must maintain multiple database connections, and as the number of shards (and users) increases so does the number of connections. This can become very resource intensive as the amount of shards increases, on the client, the database nodes, and the network infrastructure in-between. An effective sharding solution must be able to pool and reuse connections effectively to maintain scalability.

    The difficulty of connection management is significantly increased in a solution that attempts to utilize the elasticity of SQL Database to scale-out/scale-in as demand increases and decreases. As shards are added and removed, and data rebalanced across the shards, any specific piece of data can be moved between shards. At this point in time this movement of data may not be a transparent process to the application. To retrieve a record the application needs to know which shard to access, and hence the connection information for any particular data item may change dynamically and therefore it is not possible to configure ahead of time.

    With ADO.NET, each unique connection string creates a new connection pool, and the proliferation of connection pools can become a resource limit on the client as well as with all systems between the client and server that require maintaining connection state. Future enhancements to SQL Database are planned that will provide a technique for performing dynamic connection routing directly within SQL Database to relieve this concern.
  • Schema Management
    All changes to table structures (including add, drops, and manipulation of keys) must be propagated transparently to all shards implementing those objects. As the number of shards increases, this becomes increasingly cumbersome and error prone. For issues of consistency it is usually the case that the application must be down while schema changes are made.

3 A Conceptual Framework for Sharding with SQL Database

Having outlined the multi-master sharding archetype and the challenges involved with utilizing it for sharding data, it is now possible to outline a set of guidelines and processes for effectively performing sharding using SQL Database.

The stated guidelines are intended to be prescriptive to remain applicable to any particular application technology and serve as general design concepts that provide a robust sharding implementation.

For purposes of example, more descriptive guidance will be given in subsequent sections of this guidance. Specifically, section 4 will summarize a potential implementation using ADO.NET, and section 5 gives a description of applying the concepts of section 4 to the multi-master sharding archetype.

3.1 High Level Design Concepts

The following table gives a summary of the various concepts that will be described to successfully implement application level sharding using SQL Database. A brief summary is provided to introduce them and provide a framework for terminology used in design decisions given in the following section.

Concept

Purpose

Physical Shards

An abstraction of the physical databases.

Virtual Shards

A logical construct which simulates a larger number of shards to provide isolation to the application from details of physical shards.

Entity Definitions

The structure of the data to be stored and used to generate queries and manage schema.

Atomic Units

Defines the logical grouping of data entities that must be stored in the same shard and the key used for those entities.

Sharding Strategies

Definition of how atomic unit keys are mapped and retrieved from shards.

Shard Metabase Directory

Stores the configuration information for the applications sharding structure.

3.2 General Design Decisions and Constraints

The following are various design criteria that have been defined to set a baseline that can be used to maximize the elastic capabilities of SQL Database and to provide an effective application level sharding solution that will leverage planned extensions for SQL Database Federations.

The guidelines can be broken down into several categories:

3.2.1 Data Entities and Atomic Units

  • A shard is a single SQL Database.
  • Data is described as a set of entities that contain attributes (essentially tables and columns).
  • Entity metadata is used to generate queries and manage schemas across multiple shards.
  • A shard implements storage for one or more sets of entities that are referred to as atomic units.
  • An atomic unit implements a set of related entities.
  • All entities in an atomic unit are bound together by the atomic unit’s shard key.
  • All entity instances within a single atomic unit are always stored within the same physical shard.
  • All tables implementing entities in an atomic unit will contain a column for the shared shard key.
  • Applications organize their data around atomic units that will leverage the elastic capabilities of SQL Database and future capabilities that will be added for sharding.
  • Keys for atomic units can be of data types big integer or GUID.
  • The specific shard used to store an atomic unit is defined using one of the given shard selection strategies, where a shard selection strategy maps a key to specific shards by applying different algorithms.
  • An application using SQL Database for sharding should be designed to take advantage of the elasticity of SQL Database.
  • Applications must assume that any atomic unit can be moved between shards by SQL Database or other processes during rebalancing.
  • The set of shards in a solution can change dynamically, and therefore connections required for a specific atomic unit can change dynamically.
  • Applications always discover where data resides instead of assuming it is within a specific shard.
  • Once an atomic unit is located, queries within the atomic unit should be constrained within the context of an atomic unit that means a single shard.
  • Atomic units are critical to the effective rebalancing (splitting and merging) of shards. The rebalancing processes will ensure that all related entities in an atomic unit are never split.
  • Since related entities in an atomic unit are all stored within the same shard, then joins and transactions can be issued within a single atomic unit.
  • A sharding application specifies a fixed number of “virtual shards”. Keys for atomic units are generated and mapped not to the actual physical shards but to the virtual shards, which are then mapped into physical shards. This level of indirection provides a flexible means of assisting rebalancing of shards by being able to remap results of mapping functions into different shards instead of having to rewrite identifiers for atomic units when rebalancing shards.
  • Applications do not depend upon strict integrity across logical entities residing in separate shards and integrity should not be checked across shards.
  • Transactions are not supported across SQL Databases and therefore should not be designed into the application. Transactions can be used supported within an atomic unit and its shard.
  • Joins are not supported across SQL Databases, and hence an application does not issue joins for data across shards.
  • Related data is mapped into the same shard or separate queries are run in different shards and the data manually joined at the application level.
  • Data is not grouped or aggregated across physical shards. If this is desired it is performed at the application level.
  • Shard keys are considered immutable and once assigned will not change through application level processes or through rebalancing processes.
  • Global data is stored in a separate non-sharded database and manually updated in results from queries in the shards, and reference data is stored in all shards in normal tables.

3.2.2 Elasticity and Location of Atomic Units

  • An application using SQL Database for sharding should be designed to take advantage of the elasticity of SQL Database.
  • Applications must assume that any atomic unit can be moved between shards by SQL Database or other processes during rebalancing.
  • The set of shards in a solution can change dynamically, and therefore connections required for a specific atomic unit can change dynamically.
  • Applications always discover where data resides instead of assuming it is within a specific shard.
  • Once an atomic unit is located, queries within the atomic unit should be constrained within the context of an atomic unit that means a single shard.
  • Atomic units are critical to the effective rebalancing (splitting and merging) of shards. The rebalancing processes will ensure that all related entities in an atomic unit are never split.

3.2.3 Design Constraints

  • Since related entities in an atomic unit are all stored within the same shard, then joins and transactions can be issued within a single atomic unit.
  • A sharding application specifies a fixed number of “virtual shards”. Keys for atomic units are generated and mapped not to the actual physical shards but to the virtual shards, which are then mapped into physical shards. This level of indirection provides a flexible means of assisting rebalancing of shards by being able to remap results of mapping functions into different shards instead of having to rewrite identifiers for atomic units when rebalancing shards.
  • Applications do not depend upon strict integrity across logical entities residing in separate shards and integrity should not be checked across shards.
  • Transactions are not supported across SQL Databases and therefore should not be designed into the application. Transactions can be used supported within an atomic unit and its shard.
  • Joins are not supported across SQL Databases, and hence an application does not issue joins for data across shards.
  • Related data is mapped into the same shard or separate queries are run in different shards and the data manually joined at the application level.
  • Data is not grouped or aggregated across physical shards. If this is desired it is performed at the application level.
  • Shard keys are considered immutable and once assigned will not change through application level processes or through rebalancing processes.
  • Global data is stored in a separate non-sharded database and manually updated in results from queries in the shards, and reference data is stored in all shards in normal tables.

3.3 Conceptual Elements for Application Sharding with SQL Database

3.3.1 Shard Metabase Directory

The shard metabase directory is a repository of all configuration information and will be consulted by the application regularly to perform sharding. This data consists of the descriptions of the physical shards, mappings of virtual shards to physical shards, entity definitions, atomic unit definitions, strategies for atomic unit key definition and their mapping to virtual shards, and the strategies for query distribution across shards.

3.3.2 Physical Shards

A physical shard is a single SQL Database that is used to store atomic units of data. Each shard will store a subset of the applications overall data based upon how the application entities are mapped to the shards through the use of various strategies defined within the shard metabase directory.

3.3.3 Virtual Shards

Virtual shards play an important role in providing flexibility to a sharded solution by being able to simulate a larger amount of shards while using a smaller (or equal) amount of physical shards, and to provide flexibility in rebalancing without having to rewrite shard keys.

The general guidance with virtual shards is to plan for the maximum number of shards that would ever be needed to store the data. Think big, as virtual shards come for free, but must be fixed in size from the beginning otherwise rewriting of shard keys may need to be performed if the number of virtual shards has to be changed.

To demonstrate, the following is a practical example of using virtual shards to provide a scalable solution based upon demand. Suppose the application is providing ticket sales for an event and an arena, and you want to distribute the data processing load as tickets go on sale, but keep the number of databases in use low when not needed.

Virtual shards can be used to facilitate this solution by defining a virtual shard for each section of the arena. Prior to sales starting, since transaction volume is low, all virtual shards can map to a single SQL Database shard. When tickets initially go on sale, the solution can be rebalanced by adding a shard for each individual section. Then the one existing shard is rebalanced into all of the other shards and the virtual shard map changed to be 1:1 with the individual shards to distribute the data processing load. When demand eventually lowers, the shards can be merged and the virtual shards all mapped back to the single shard, and the excess empty shards deleted.

3.3.4 Entity Definitions

An entity is a description of a data element that forms the fundamental unit of storage in a shard and basically maps to the table structure of the data within the shards. These definitions are used to determine how to perform various tasks including query generation and management of shard schema.

3.3.5 Atomic Units

An atomic unit is a set of related entities that share a common sharding key and are always stored together in the same shard. One entity in each atomic unit will be defined as having its key field as the shard key for the atomic unit. The shard key will be stored in all other entity tables and used by the data provider to group related entities and navigate relationships between the entities in the atomic unit. A brief example would be to define a customer and order to be in the same atomic unit, and that the atomic unit utilizes the customer entity’s key as the atomic unit key. Orders that map to a particular customer by the customer key should always be written to the same shard as the customer with the same key.

3.3.6 Sharding Strategies

Shard strategies define the rules by which an application distributes and accesses the data (atomic units) across shards. A policy is a set of strategies that will contain one each of a key generation, access, resolution, selection and combine strategy. A policy is then applied to each defined atomic unit to determine where to store and manipulate an atomic unit in the shards. The strategies also instruct how to parallelize queries when multiple shards are identified as candidates for a particular atomic unit, and how to combine results that potentially are returned from multiple shards.

Each category of strategy is now briefly introduced with recommended implementations.

3.3.6.1 Key Generation

Key generation strategies inform the application how to allocate keys for entities and related atomic units. There are three recommended key generation strategies: Application, Sequential Big Integer, and GUID.

3.3.6.1.1 Application

The application is responsible for generating the ID’s for the entities, and can use identifiers of type big integer or GUID for the entity ID’s. When using this strategy the application will need to select either a range or hashed selection strategy to define how to map the ID’s to shards.

3.3.6.1.2 Sequential Big Integer

Use of this strategy defines the allocation sequential big integers that are guaranteed to not collide across shards for a given entity. When using this strategy an application should initially define the minimum and maximum range of ID’s that you will ever use, typically 0 through max big integer.

3.3.6.1.3 GUID

The GUID strategy instructs allocation of GUIDs as identifiers and will therefore not have collisions, but you will likely need to deal with unfriendly ID’s for the applications entities. When using this strategy an application will need to use an ID hash selection algorithm to select shards.

3.3.6.2 Access

Access strategies are used to inform how to apply database operations across the shards. The selection of access strategies assists in defining the means by which queries are distributed and results collected from the various shards, and can be either sequential of parallel.

3.3.6.2.1 Sequential

A sequential access policy instructs how to distribute the queries sequentially across the shards, waiting for results from the current shard before distributing the query to the next. This is typically a poor performing strategy when data is evenly distributed across shards as it can significantly impact the shards that are first in the list. But it is useful as it can be used to optimize performance when it is known that a required entity is in one of the first shards to be queried (but not knowing specifically which) and the complexity of managing parallel queries is not needed.

3.3.6.2.2 Parallel

The parallel access strategy instructs how to distribute the queries simultaneously across all target shards and wait for the result from every shard, combing all the results together based upon the combine strategy. This is often referred to in sharding systems as fan-out and can provide high performance when the data is evenly distributed across the shards.

3.3.6.3 Selection

Selection strategies are used to determine which shard a new entity should be stored. There are four strategies that are commonly used in a sharding application: ID Range, ID Hash, Round-Robin, and All Shards.

  • ID Range
    This strategy informs application to distribute the atomic units across the shards based upon a range of ID’s assigned to each shard. To use this strategy the key generation must be sequential big integers, and each virtual shard is allocated a sequential subset of numeric ID’s.
  • ID Hash
    This strategy informs the application to apply a hashing algorithm to the atomic unit ID’s to distribute the atomic units evenly across the shards, and normally hashes big integers based upon the number of virtual shards, and GUID’s based upon a GUID hashing algorithm that takes into account the number of virtual shards.
  • Round-Robin
    Using this strategy specifies the rotation of writing new atomic units across the set of shards. The end result is that the data will be evenly distributed across those shards. Use of this strategy will necessitate the use of the All Shards resolution strategy as it will not be known how to determine which shard stores an atomic unit based upon its shard key.
  • All Shards
    The atomic unit will be written to all shards. This strategy is used to write the data to all shards. Multiple clients configured to access the shards in different sequences or from different geographic locations can have retrieval optimized in trade off of space.
3.3.6.4 Resolution

Resolution strategies are used to determine the set of shards on which an atomic unit may reside after being written. There are two main resolution strategies: All and ID Based.

  • All
    The all shards resolution strategy instructs the data provider to ask all known shards for the entity. This strategy is useful when it is not known exactly which shard an atomic unit resides in based upon its key (as with the round-robin selection strategy), or when the key is not known and an attribute-based query must be performed. How queries are fanned out over the shards depends upon the access strategy that is selected.
  • ID Based
    This strategy is used to determine the shard on which an atomic unit exists by applying an algorithm based on the ID and is tightly coupled with the selection strategy algorithm. Using any of the key generation and selection strategies can deduce the proper shard from the provided ID.
3.3.6.5 Combine

Combine strategies define how to merge data that is returned from multiple shards when a query is fanned-out, and there are two: All and First.

  • All
    Combine results from all shards and return the combined set of data.
  • First
    Return only the first entity in any shard. This is useful when the specific shard for an atomic unit is not known and all shards are queried, where only one shard will contain the entity, or more than one shard contains a copy of the same data and the first one to return data is satisfactory.

3.4 Atomic Unit ID Generation and Shard Mapping

Perhaps the most important part of the sharding process is the generation and mapping of shard keys to the actual shards. This is done in an application as prescribed by configuring the shard metabase directory to specify a key generation algorithm and mapping function to apply to the shard key of the atomic unit. The result of this function is used to determine the virtual shard used to store the entire atomic unit.

There are several key generation and mapping functions available to an application. The selection of key generation and mapping functions has significant ramifications on the operational characteristics of the shards.

The following of are the two recommended techniques:

Generation Technique

Description

Sequential Big Integer

Allocate unique big integers incrementing in order from a specific minimum value up to a maximum specified value.

GUID

Generates a stream of unique GUIDs.

The following are the recommended mapping algorithms:

Function

Applies to

Description

Range

Sequential Big Integer

Specifies that keys between and inclusive of two values that will be stored in the specified virtual shard.

Modulus

Sequential Big Integer

The key value modulus of the number of virtual shards specifies which shard to store the atomic unit.

GUID Hash

GUIDs

Hashes the GUIDs to one of the defined virtual shards.

The combination of key generation and mapping functions, also combined with the number of virtual shards and their mappings to physical shards, determines how data is distributed across the shards as it is added (and rebalanced).

Key Generation

Mapping Function

Characteristics

Sequential Big Integer

ID Range

Atomic units should be stored according to the range of values for each virtual shard. Each new atomic unit will be written to the same shard until the value reaches the maximum for the virtual shard, at which point the atomic units will be written to the next virtual shard where the ID fits. This in essence fills one shard at a time, spilling over into the next shard when the current shard is completely out of identifiers.

 

ID Hash

The identifier will be modulus the number of virtual shards. The effect is that the atomic units will be evenly spread across the virtual shards. The allocation will rotate across virtual shards from lowest to highest and then repeat.

GUID

GUID Hash

Each GUID will be hashed and mapped to the number of virtual shards. The effect is to evenly distribute the atomic units across the virtual shards, but the order of allocation is nondeterministic.

3.5 Atomic Unit Discovery

A specific atomic unit can be stored in any shard, and through the rebalancing processes it can be moved to different shards over time. Because of this, an application should always look up the value of the shard key for an atomic unit so it can be used to bind queries to the shard where the atomic unit resides.

Discovery of the shard key can be done in two ways. The first is when an atomic unit is first created and initially stored in a shard, the application can store the shard key in persistent, non-sharded storage for later lookup. The alternative is to query all of the shards by entity attribute values to return the entity, and thereby the shard key for the atomic unit containing the entity.

Once the atomic unit’s shard key is retrieved, the resolution strategy for the atomic unit can be used to determine the virtual shard, and hence physical shard, that currently contains the atomic unit. That shards connection can then be used to retrieve the desired entities.

3.6 Connection Management

A sharded application will require simultaneous open connections to one or more shards. To optimize the process, smart connection management pooling must be implemented in the application. The current implementation of connection pooling in ADO.NET creates a pool for each unique connection string. Therefore as the number of shards increases, the applications intelligence for optimizing connection management must also increase in order to optimize retaining only the open connections that are currently required.

3.7 Rebalancing

A rebalancing entails performing a split or merge of one or more shards. This involves the creation or removal of one or more shards and the movement of atomic units of data between shards. A split of a shard will add an additional shard and move a certain set of atomic units into the new shard, and adjust the virtual shard maps to spread the load across the new set of shards. Shards can also be merged by moving data from one shard into another and eliminating the originating shard.

Splits are undertaken when either a shard reaches storage capacity or the shard needs to be scaled-out to utilize more capacity. In the former situation atomic units will be moved into a new shard, leaving two shards that now distribute the load of the application.

In the latter scenario, where either the shard is no longer performing up to desired levels, or it is identified that the load on the shard will be increasing it can be decided to split the shard to better handle the increase in load. The shard can be split into two or more shards and the sharding strategies adjusted to evenly distribute the load. If the load then decreases, the additional shards can then be merged back into a smaller set of shards.

This splitting and merging of shards is core to the application being able to take advantage of the dynamic database capabilities of SQL Database to provide a scalable data solution. Applications should be designed to expect that shards would be rebalanced as needed to support the demands of the application and that no assumptions should be made as to the actual location of any atomic unit.

The implementation of rebalancing is different depending upon the type of mapping function used (details won’t be discussed here) but follows a similar pattern for all. In the case of a split, a new shard is added and a virtual shard is assigned to the new shard. Data from the physical shard that is being split and which should now be stored in the new shard is then moved into that new shard. Through the process of remapping the virtual shards the application now utilizes the new shard without any changes to the application. Merges are similar except that a specific shard is identified to be removed, which will have its data merged into another shard and the virtual shard map modified, pointing to the shard which now contains the data.

3.8 Schema Management

An application that shards its data will need to ensure that the schema for the entities used by the application is consistent across all shards. All changes to table structure (including add / drops of tables and columns, renaming of objects, and manipulation of keys) must be propagated transparently from the application to all shards. As the number of shards increases, this becomes increasingly cumbersome and error prone. To ensure consistency, it is common that the application must be down while schema changes are made.


4 Application Design for Sharding with SQL Database and ADO.NET

This section describes the key components that require implementation in a sharding solution built on SQL Database. The components are described as being built with .NET and using ADO.NET as the data provider for SQL Database. The key components described can take advantage of the native functionality in SQL Database with Federations when made publicly available. The expected feature set for SQL Database Federations is described in section 6.

4.1 Design Goals

Presented here is a high-level design for a sharding aware ADO.NET based library that implements the concepts described in section 3. Using an ADO.NET based approach conforms to a standard, familiar pattern and abstracts the implementation details from the application.

Specific design goals are:

  • Describe an ADO.NET based sharding library that handles many of the details of sharding
  • Provide support for many different types of strategies for generating entity and atomic unit keys and determining which shard(s) to store and retrieve data
  • Automatically handle query parallelization to multiple shards
  • Operate on a shard key / atomic unit model to ensure related data is stored within the same shard
  • Provide a sound foundation for supporting sharding features that will be introduced into SQL Database which will enable sharding applications to easily migrate to the new features in SQL Database when they are available by simply changing the implementation of the sharding library.

Following patterns utilized elsewhere in this guidance, this guidance is prescriptive, avoiding being constrained to a specific implementation, but a further refinement will be given in section 5 which will describe in slightly more detail a potential implementation.

4.2 The ADO.NET Based Sharding Library

This is a set of classes that implement the ADO.NET data set and data adapter patterns and implement the sharding patterns defined in section 3. The ADO.NET based sharding library will utilize the sharding metabase directory to gain the structure of the data, and the strategies to take for sharding on behalf of the application.

The sharding library then provides to the application a set of classes (and objects) that implements an ADO.NET like experience while the implementation handles the details of sharding. The sharding library provides two primary classes to the application, the connection class and the data adapter.

4.2.1 Connection class

The connection class will not be given a standard connection string, but information on how to load the shard metabase directory. It will read this information and understand how to manage connections to the various shards, how to work with the data adapters to construct and distribute queries, and how to return results to the application through the defined strategies.

4.2.2 Sharded Data Adapter

Data manipulation in a sharded application is inherently disconnected and therefore manipulated only through the data adapter classes. Also, when using a shard metabase directory there is no need for insert, delete, update and select command objects as the shard metabase directory entity descriptions provide this information and therefore it can be inferred and used for optimization.

New entities and atomic units can be stored in the shards by inserting new rows into data sets and tables representing the entities in the metabase. These tables can be constructed from the connection object or data adapters and have the appropriate structures put in place from the metadata. Upon issuing the update statement on the data adapter for a given data set or table, the data adapter will use the defined policies to select the shards to store the entities and atomic units as well as generate identifiers for the atomic units.

4.3 Usage Concepts

The following explains several of the common tasks when using the sharding library.

4.3.1 Creating Connections

The connection class is implemented to be shard aware. During instantiation it is provided a reference to the shard metabase directory (instead of a connection string) that it reads and uses to provide connection management to the sharding process. Connections are not explicitly managed by the application as the connection class is coded to optimize connection management across all of the shards on behalf of the application.

4.3.2 Data Adapters

All data manipulation in the application is done through the use shard aware data adapter objects. These data adapters are provided information from the sharded metabase directory by the connection object, and use this information to construct queries and execute strategies for data manipulation across the shards. The shard aware data adapters do not have Delete, Insert, Update and Select command properties as the T-SQL for these can be inferred from the shard metabase directory.

4.3.3 Data Manipulation

Data can be queried by constructing an expression tree and passing that expression to the data adapter, where the data adapter will parse the expression tree and generate the appropriate T-SQL and distribute it to the correct shards. Basic queries only work against a single entity. The data adapter class also provides methods that can be used to easily generate basic expression trees that are used to apply expressions to entity properties.

A useful case for a basic query is finding the shard key of an atomic unit by comparison of several of the primary entities properties to values in order to return a single entity along with the shard key for the containing atomic unit. Another case is in retrieving related entities based upon the shard key from the primary entity in the atomic unit.

Inserts are performed by requesting the data adapter to create a DataTable for a specific entity, and then by following the standard ADO.NET pattern for inserting new rows, by creating a new row, adding it to the table, setting its column values, and then performing an Update of the DataTable against the data adapter. The implementation of the Update method will consult the various strategies for the table and perform the appropriate actions against the required shards.

Updates are performed by first using the data adapter to fill a DataSet with result values from a basic query, which will populate the DataTable for the appropriate entity with the result values. The DataTable and data adapter will track both the originating shard for the rows as well as changes to any properties. Upon execution of the Update method on the data adapter, all rows that have changes will be identified and appropriate T-SQL delivered to only the specific shard containing the entity.

Deletes can be performed in two manners. The first is by retrieving a set of records via a basic query, then removing the rows from the DataTable, and executing the Update method in the data adapter. The data adapter will then deliver the appropriate T-SQL to delete those rows to the specific shard storing those rows. The second is an optimization provided by the data adapter that allows the creation of a basic expression tree (as in the basic query scenario) and then passing the expression to a method of the shard aware data adapter that will construct the delete T-SQL and deliver it to the appropriate shard(s).

4.3.4 Complex Queries (Joins, grouping, aggregates, …)

The ADO.NET based sharding library provides a shard aware implementation of two variations of a complex query command that can be used to execute free form T-SQL against the shards. Results from these queries will be returned as part of a Fill method call on the data adapter into a DataSet provided by the application. Rows returned by this process can be manipulated but will not be tracked and hence cannot be used to update, delete or add new rows. These command objects have a specific use in performing joins and groupings to drive calculated results or grouped data.

The first form of command object constrains the query to execute only within the context of a single atomic unit by providing the shard key of the atomic unit. This will send the query as-is to only that specific shard for execution. This is useful for performing joins within the atomic unit to derive meaning from the relationships. An example would be to retrieve the amount of products that are part of an order for a specific customer.

The second form will execute the T-SQL against all shards that implement a specific atomic unit’s entities. The results from each shard will be merged together into an aggregate result set. For example, if a SUM statement is executed against 10 shards, 10 rows will be returned. It is the responsibility of the application to then perform the higher-level aggregation of the data. This is useful to use when trying to determine results against all atomic units stored in all shards.

4.3.5 Transactions

Transactions are supported with the standard ADO.NET transaction coding patterns, but should only be supported against data sets that are results from a single atomic unit. This enforces the policy that transactions only work within a single shard.

4.3.6 Bulk data insert

Bulk data insertion into a sharded set of databases requires the creation of a custom application that reads the input data into data tables created from a connection and data adapter configured from the shard metadata directory.

4.3.7 Reporting

Reporting is complicated by sharding because of the distribution of data across multiple shards, and by the fact that a non-shard aware reporting system does not know how to retrieve data that is distributed across multiple shards. To facilitate sharding, the ADO.NET based sharding library can be used to define a reporting façade that can execute complex queries against the set of shards and return the results to the reporting system.


5 Applying the Prescribed ADO.NET Based Sharding Library

The following are examples of applying the prescribed ADO.NET based sharding library to the multi-master archetype described earlier in the guidance.

For demonstration purposes a common scenario will be applied and is one that is common to many social networking sites: A user of the system can create status updates and add photos into their account to share with others. The data storage needs to be scalable as it is expected that the popularity of the site will increase to many millions of users.

This is an excellent use case for a sharded data solution as the data can be partitioned and scaled in several different ways, as will be demonstrated, with the scenario being applied to three different sets of strategies, each providing a different distribution of data through the shards.

Common to all the different implementations are several tasks that are to be undertaken. The first of these is to define the entities that will be stored in the database shards. For purposes of these scenarios the following will be the entities utilized:

Entity

Attributes

Data Type

User

User ID

Big Integer

 

E-mail Address

Nvarchar(50)

 

Password

Nvarchar(50)

 

 

 

Status

Status ID

Big Integer

 

Text

Nvarchar(142)

 

Time

Date / Time

 

 

 

Photo

Photo ID

Big Integer

 

Blob URL

Nvarchar(250)

The next task is to create the shards that will be used by the application to store the entities. Each shard is simply a SQL Database, and these can be created in several manners including using the SQL Database Administrative portal or DDL. For our example we will assume that two shards are created and that the connection strings for each are as follows:

Shard ID

Connection String

0

Server=tcp:xxxxxx.database.windows.net;Database=Shard1;User ID=adminuser@ xxxxxx;Password=myPassword;Trusted_Connection=False;Encrypt=True;

1

Server=tcp:xxxxxx.database.windows.net;Database=Shard2;User ID=adminuser@ xxxxxx;Password=myPassword;Trusted_Connection=False;Encrypt=True;

(note, actual server names obfuscated).

To keep the solution simple, we will use two virtual shards, with virtual shard 0 mapped to physical shard 0, and virtual shard 1 mapped to physical shard 1.

Next to be completed is the definition of the atomic unit and the shard key. In these examples a single atomic unit containing all three entities will be created. The atomic unit will have the UserID property of the User entity defined as the key for that entity as well as being the shard key of the atomic unit.

When creating the tables that represent these entities a column will need to be added that contains the shard key and it should be defined as a foreign key to the UserID property of the user entity.

Once the entities, atomic unit, and shard key are defined, it is possible to create the database objects. From this structure the following T-SQL can be executed in both shards to create the database objects:

CREATE TABLE [dbo].[User](
    [UserID] [bigint] NOT NULL,
    [EmailAddress] [nvarchar](50) NOT NULL,
    [Password] [nvarchar](50) NOT NULL
  CONSTRAINT [PK_User] PRIMARY KEY CLUSTERED 
(
    [UserID] ASC
 )WITH (STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF)) 
 
CREATE TABLE [dbo].[Status](
    [StatusID] [bigint] NOT NULL,
    [Text] [nvarchar](142) NOT NULL,
    [UserID] [bigint] NOT NULL
  CONSTRAINT [PK_Status] PRIMARY KEY CLUSTERED 
(
    [StatusID] ASC
 )WITH (STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF))
 
ALTER TABLE [dbo].[Status]  WITH CHECK ADD  CONSTRAINT [FK_Status_User] FOREIGN KEY([UserID])
 REFERENCES [dbo].[User] ([UserID])
 
CREATE TABLE [dbo].[Photo](
    [PhotoID] [bigint] NOT NULL,
    [BlobURL] [nvarchar](255) NOT NULL,
    [UserID] [bigint] NOT NULL
  CONSTRAINT [PK_Photo] PRIMARY KEY CLUSTERED 
(
    [PhotoID] ASC
 )WITH (STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF))
 
ALTER TABLE [dbo].[Photo]  WITH CHECK ADD  CONSTRAINT [FK_Photo_User] FOREIGN KEY([UserID])
 REFERENCES [dbo].[User] ([UserID])

At this point the physical structure is in place to support the sharding needed by the application. In a real world scenario, details of this configuration would also be stored in a configuration file or available via a web service for the connection object to be able to load and gain an understanding of the sharding structure. This information would also contain the sharding strategies, but these differ by scenario and will be covered in subsequent sections.

The application would use the sharding aware connection object to load the configuration information from the configuration file, a web service, or a database. This information contains the entity, atomic unit, shard key, and all strategies defined for managing the sharding. The sharding aware connection object has the information needed to create the various other ADO.NET objects required to perform the sharding, primarily the shard aware data adapter, DataSet and DataTable objects that are configured by the application to support the sharding definition given in the configuration.

The next step is to insert one or more primary entities in the empty shards. This needs to be done first since the UserID for the User entity is the shard key, and it will be needed to store status and photo entities. In the case of an application key generation strategy, those ID’s will simply be those created by the application, but in the case of keys generated by a sharding library, the value in these columns will be updated to be the shard assigned keys.

To store an initial user entity, the application creates an instance of the sharding aware data adapter passing its constructor the connection object whereby the data adapter can refer to the sharding metadata in the connection object to know how to adapt data. The application then makes a method call on the data adapter object that returns a DataSet object containing a data table representing the User object.

This data set can be used as normal in ADO.NET to insert new users into the database by creating new row objects and performing an update call on the data adapter. The data adapter, using the strategies contained in the shard metadata, will then distribute the User entities to the appropriate shards based upon the defined selection strategy, and if using system generated keys it will generate keys for the new rows that can be used as the shard key.

Related entities (photo, status) can then be inserted similarly by creating new rows in the respective tables, with one caveat. Each row will require having the shard key column set to be the value of the related user ID. Upon performing an update on the data adapter, this value will be used to determine the appropriate shard to store the new entities by using it as the shard key.

5.1.1 Round-robin storage of new entities

This technique focuses on distributing data evenly across the shards using a round-robin process. It utilizes application-derived keys for the entities. Lookup of the primary entity is by attribute and not by key; this is actually required by the round-robin selection strategy, as a mapping from keys to shards is not available with the round-robin process.

The technique for key generation will be that the application will use its own process of generating keys for the entities. Since all keys are application defined, and the strategy for selection will be round-robin then no distribution (such as hashing) function need be defined. The strategy definitions for this are:

Strategy

Technique

Reasoning

Key Generation

Application

Application wants control of the ID’s allocated to the entities and atomic units.

Access

Parallel

Many queries will need to be distributed to all shards since the selection strategy is round-robin and which shard to use cannot be determined from the ID.

Selection

Round-robin

Round-robin writing new entities through the virtual shards to evenly distribute the data across the shards.

Resolution

All

All shards will be queried for data since we are doing a round-robin selection of the shard for new entities which means we will not know which shard any entity is stored in.

Combine

All

Queries for a specific entity should only come back from a single shard, but in the case of queries returning entities for multiple users we would like to combine all the results.

The process of inserting data follows the same process as is defined in the introduction. The sharding aware connection object will be instantiated, and a sharding aware data adapter created using that connection object, and then a data set created via the data adapter.

In this scenario we would want to create two user entities and store them in the shards to utilize the round-robin capabilities. This can be done by creating two new rows in the User DataTable, by setting the column values appropriately, and then performing an Update call on the data adapter. The data adapter will then rotate through the available shards storing each new User table row in the next shard of the round-robin.

Insertion of new Status and Photo entities can be done by creating new rows in those DataTable’s, and by setting the UserID shard key column to the proper value which is returned from the User entity inserts. When performing the update, the sharding aware adapter object will use this information to determine the shard that the related user entity is stored within and write those photos or statuses to that shard.

5.1.2 Automatic key generation, hashed distribution

This scenario varies from the one in section 5.1.1 in that the sharding library will generate keys, and that an ID hash strategy will be used to determine which shard is used to store new user entities. The strategy configuration is as follows:

Strategy

Technique

Reasoning

Key Generation

Sequential Big Integer

Min: 0, Max: Max Big Integer

Access

Sequential

The shard can be determined by using the key and ID hashing algorithm to identify the specific virtual shard, there is no reason for parallel access.

Selection

ID Hash

ID’s will be modulo 2 (for the two virtual shards)

Resolution

ID Based

Apply the ID hash to the ID to determine the virtual shard.

Combine

First

There should only be one returned, just take the first.

 

Shard keys will be generated by the library using a sequential allocation scheme that ensures unique keys across the shards and entities. The code to work with entities would be the same as in section 5.1.1 except that it does not need to specify the ID fields for the new entities. The system will allocate the ID’s, and upon return from the Update those fields will have been set by the data adapter. In this case when writing the first two entities the ID’s would be 0 and 1, and the first entity will be written to shard 0 and the second to shard 1.

5.1.3 Automatic key generation and sharding by ranges

This scenario modifies the one in section 5.1.2 by making shard selection based upon ranges of shard keys assigned to the virtual shards. Each virtual shard will be configured to use a range of IDs, for this example virtual shard 0 will have ID’s 0-99, and virtual shard 1 will have ID’s 100-199.

The configuration strategy differs from section 5.1.2 by the selection, resolution and combine strategies:

Strategy

Technique

Reasoning

Selection

Range Based

Virtual Shard1 -> 0 .. 99

Virtual Shard2 -> 100 .. 199

Resolution

ID Based

Use the selection strategy if needed to find the shard.

Combine

First

Data should only be coming back from one shard.

Coding is the same as in section 5.1.2, except storage will be different in that if 200 users are written to the shards, the first 100 (with ID’s 0-99) would be written to shard 0, and the second 100 written to shard 1.


6 SQL Database Federations

Direct support for sharding will be added to SQL Database via the addition of SQL Database Federations. A federation is a loosely coupled collection of shards that contain the data that makes up the sharded database. Each shard or federation member is implemented as its own SQL Database. There is also the notion of a federation root which contains metadata about the federation, its members, users, etc. and which serves as the basis for connections and operations on the members. Federation members are SQL Databases.

The following schematic outlines the concepts in SQL Database with Federations:


Figure 3: SQL Database Federations

The root database functions as the landing point for the federation. Applications connect as normal to the root database, and can manipulate objects in the root database as normal objects, but can also issue new T-SQL commands that re-route the connection to individual shards based upon the information defined within the federation scheme that will be managed by SQL Database as the shards are split and merged. This pattern of connecting to the root database and then selecting a shard based upon federation key provide a means within the fabric to reroute requests dynamically to exactly the shard where the data is located, which allows for both shards to be added and dropped on the fly, as well as to use only a single connection from the application instead of one for each shard.

A federation scheme defines the structure of how data will be federated (sharded) across one or more logical SQL Databases. This information defines the federation key(s), the data types of the key(s), and the distribution type of the key(s). There is exactly one federation scheme per federation. A root database may contain more than one federation, which is useful for defining multiple shard federations that use different federation keys.

Each federation will consist of one or more federation members. A federation member exists for several reasons. First, it defines the unit of mapping of data to physical SQL Databases within the shard federation. Each federation member will be implemented by one SQL Database.

A federation member will contain one or more tables that have a federation key relationship. Records for each table that share a common federation key will always be stored within the same federation member and are referred to as a federation unit (analogous to SQL Database Shards for ADO.NET’s atomic unit). Federation units cannot be further subdivided, and this ensures related data (based upon federation key) are always stored within the same physical database. Each federation member will hold data for one or more federation keys up to the capacity of the physical database instance.

Federation members may also contain non-federated tables, which are simply normal tables that are not annotated to have a relationship with the federation key. These tables can be used for reference data that can be used in assistance of look up of federated data as well as mapping of code-based fields to user-friendly strings.

Specifically which records are stored within a federation unit is defined by the value of the federation key, and by a range of values for that key which the federation unit stores. Upon creation of a new federation in the root database, a minimum and maximum value will be defined for the federation key and a single federation member will be created to support the federation, and all key values from the minimum to maximum values will be stored within this single federation unit.

A federation member can then be “split” into two federation members using a split T-SQL command, at which point an additional physical database will be created and have data entities moved from the original physical database based upon the parameters for the split.

The end result of the split is that the single physical database is sharded into two physical databases with the data spread across both. During the split, SQL Database will ensure that the view of the federation unit being split is consistent until the split is complete, thereby allowing data to be further sharded while the application remains online.

It is also intended that the system will eventually support a merge command that will combine two federation units into a single federation unit, collapsing the number of physical databases in use so as to save costs. Like the split command, the system will ensure the view of the two merging shards is consistent until the merge is complete.

This dynamic adding and removing of new shards to the federation can be done on command via T-SQL upon the user or application analyzing operational data stored in the federation schema in the root database. It is expected that this will be automated in SQL Database over its evolution.

Applications built for sharding to should be designed with these features in mind, providing a similar shard / federated key model for sharding that will be inherently provided by new functionality added to SQL Database. By following the patterns prescribed earlier in this guidance, an application will be able to immediately make use of SQL Database with Federations when it is available.


7 Summary

This guidance has discussed the means of performing sharding with SQL Database as it is today, along with guidance on building an ADO.NET based sharding library that implements recommended practices for sharding with SQL Database.

Also presented is information showing how sharding features are going to be built directly into SQL Database, and how the design of the ADO.NET based sharding library will provide a head start on sharding with SQL Database today and provide a migration path when those features are available in SQL Database.

In order to be brief and concise, the guidance has remained prescriptive, and certain concepts such as rebalancing, schema management, adding and removing of shards have only been given a cursory mention. Further detail will be provided in future guidance on these and all other topics mentioned.

To keep up to date on new sharding guidance with SQL Database and the new federation features, go to our blog at: http://blogs.msdn.com/sqlazure/


See Also

The author of the original information contained in this article is Michael Heydt, SunGard Consulting Services.
Sort by: Published Date | Most Recent | Most Useful
Comments
  • Would be really hard to find a worse written article => after spending good hour reading it, I'm only more confused, why Oracle supports horizontal scale-out and MS not. Please rewrite this stuff, try make it more readable and bring something of added value, rather trying to demonstrate own academical skills. Thank you, Andrej

  • Ursine Andrej, I invite you to reply with some specific guidance on which you'd like clarity.  I'll help shepherd your suggestions to resolution, hopefully to your satisfaction.  My alias is jimmymay, & you'll find me @microsoft.com. Thanks for the opportunity to be of service.

  • Ursine Andrej, I invite you to reply with some specific guidance on which you'd like clarity.  I'll help shepherd your suggestions to resolution, hopefully to your satisfaction.  My alias is jimmymay, & you'll find me @microsoft.com. Thanks for the opportunity to be of service.

Page 1 of 1 (3 items)