Microsoft recently transformed the strategic direction of the company by expanding its already extensive software products and cloud services capabilities as part of the process of establishing itself as a major cloud services provider. Microsoft Services describes key elements of that paradigm change — including actions taken to overcome specific challenges — in this white paper.

Authors: 
Angus Foreman Architect, Microsoft Services
David Howe, Director, Architecture Services, Microsoft Services
Ulrich Homann, Chief Architect, Worldwide Enterprise Services, Microsoft Services


Note:
The official version of this document is version 1.


1 Strategic Change: Microsoft Reinvents Itself

Microsoft recently transformed the strategic direction of the company by expanding its already extensive software products and cloud services capabilities as part of the process of establishing itself as a major cloud services provider. Microsoft Services describes key elements of that paradigm change — including actions taken to overcome specific challenges — in this white paper.


Strategic Change:
• What we did  
• Why we did it
• How we did it


Will you initiate a similar transformation?

Microsoft learned a key lesson when it embraced cloud computing: such a transition requires far more effort than just adopting new technology. A successful transition requires a broad analysis of an enterprise’s IT ecosystem and a holistic, phased approach to identifying and solving interrelated sets of organizational, architectural, and design challenges. Enterprises that face similar challenges can gain insight from the Microsoft experience as they plan how to anticipate and surmount potential obstacles in their own efforts to deliver cloud services to customers.

Many similarities exist between Microsoft and its enterprise-size customers. Many global enterprises offer large portfolios of products to their customers, make significant investments in information technology (including, for some companies, software development), and must improve products while maintaining existing assets to remain competitive. An increasing number of enterprises — whatever the nature of the products or services that they offer — encounter business drivers that compel them to consider moving to a cloud-based service provider model. Such enterprises can speed their own transformation to cloud computing by taking advantage of the extensive knowledge that Microsoft gained when it implemented change — at the levels of organization, architecture, and design — in its successful journey to the cloud .

Architectural principles Microsoft developed to re-engineer itself as a major cloud services provider

Characteristics of cloud computing include such factors as scalability, consumption-based pricing, and predictable levels of service availability. Taking advantage of these characteristics can be advantageous for both the service consumer and the service provider. For example, the consumer needs to understand the cost of the services they use and, at the same time, expect predictable service delivery based on a contract. To fulfill such a contract successfully, the provider learns to deliver the service within defined parameters. The effort to fundamentally transform Microsoft from a company that was an early provider of services in the cloud (such as MSN® and Hotmail®) to a major cloud services provider began with a consideration of additional commercial and technical opportunities offered by cloud computing.

The first step was to develop a systematic approach to identify and categorize high-level goals, and then to develop strategies to manage the associated challenges. Microsoft developed the following fundamental principles as part of its effort to transform itself into a cloud services provider:

  • Strive for continual service improvement. Pursue continual improvement in an IT service — including any cloud service — and measure success on an ongoing basis by determining how well technological advancements deliver business value.
  • Deliver the perception of infinite capacity. From the consumer’s perspective, an IT service should provide capacity on demand, limited only by the amount of capacity for which they are willing to pay.
  • Deliver the perception of continuous service availability. From the consumer’s perspective, an IT service should be available on demand from anywhere and on any device.
  • Take a service provider approach. The provider of any IT service should think and act as if they are running a service provider business rather than an IT department within an enterprise.
  • Optimize use of available resources. An IT service should automatically use all resources efficiently and effectively.
  • Take a holistic approach to availability design. The availability design for an IT service should involve all layers of the stack, should employ resilience wherever possible, and should remove redundant systems whenever possible (such as when resiliency makes redundancy unnecessary).
  • Minimize human involvement. The day-to-day operations of an IT service should have minimal human involvement to reduce the chance of human error.
  • Make service attributes predictable. An IT service’s attributes must be predictable, because the consumer expects consistency in the quality and functionality of the services they consume.
  • Provide incentives that change behavior. A provider will be more successful in meeting business objectives if it defines the services that it offers in a way that motivates desired behavior from the service consumer.
  • Create a seamless user experience. Consumers of an IT service should not encounter anything that disrupts their use of the service as a result of crossing a service provider boundary.

You can use these principles — developed by Microsoft to achieve its own transformation — to help direct the introduction of cloud computing into your organization. For an in depth look at these principles refer to the article Private Cloud Principles, Patterns, and Concepts.

2 Why Shift into Cloud Services?

Rapid change requires rapid innovation. The ever-escalating rate of technological change demands a commensurate aptitude for innovating at an accelerated pace. To remain profitable, an enterprise must anticipate changes and address the resulting challenges with agility, growing leaner yet offering higher quality products.

Improving products, although prerequisite, is not sufficient. As a technology company, Microsoft realized that it must deliver capabilities as services in addition to delivering packaged products. Recognizing this imperative required a rapid but well-organized and ongoing shift from its traditional software product-based business model to a new role as a service provider.

Business drivers experienced by Microsoft

Microsoft now delivers numerous capabilities as cloud services in response to a growing demand for such services. Increasing numbers of enterprises need to invest more of their time and money in their core business assets and less time and money supporting large IT departments. Increasing numbers of large enterprises look to service providers such as Microsoft to create, provide, and operate services that are commonly required by large enterprises: server and data platforms, integration capabilities, and communication and collaboration workloads (including email, instant messaging, scheduling, and portals).


Seismic Shift
Computing is undergoing a seismic shift from client/server to the cloud,
a shift similar in importance and impact to the transition from mainframe to client/server.
[Source: The Economics of the Cloud, Microsoft November 2010]


An additional change that drove the Microsoft transformation to cloud services provider was the advent of new commercial agreements introduced by cloud computing. Consumers who buy services expect to pay for units of consumption, which, in turn, requires that the unit cost be predictable to both the consumer and to the provider. Achieving this new consumption-based business model was, for Microsoft, a central aspect of its shift in strategy away from a traditional product-based business model. Accomplishing this shift required considerable investment and advance preparation to ensure that it was done successfully.

A consumption-based pricing model for a service typically brings with it an assumption that, whenever demand grows for the service, a corresponding capacity to meet that increased demand is also available. Similarly, when demand recedes, the customer consumes less and expects a corresponding reduction in capacity. This model drives the development of an exceptionally flexible service delivery capability that must be supported by investments in appropriate technology. These investments can significantly affect revenue streams and therefore require astute risk management efforts.

Technical innovation, and the increasing commoditization of IT resources that this innovation enables, requires adaptation to remain competitive but also introduces new opportunities to gain a competitive edge. Remaining competitive — or increasing competitiveness — in this kind of market environment demands exceptional leadership, focus, and investment.

Factors associated with business drivers

The following figure identifies the main drivers that have influenced the direction Microsoft has taken and shows related factors, including actions the company has taken to achieve goals developed in response to these drivers.


Figure 1: Benefits dependency network

The diagram represents a benefits dependency network, a graphical representation of business drivers and dependent elements. This technique is often used by Microsoft Services when working with customers to help them understand how to drive transformational or strategic change.

3 How Can Evolutionary Steps Achieve Revolutionary Change?

For Microsoft to make the revolutionary transition from marketing traditional IT products — designed for use in dedicated enterprise environments — to delivering multiple technologies as cloud services requires, paradoxically, an evolutionary approach. Using a phased approach enables the gradual implementation of the attributes that are required to introduce cloud services, and it allows incremental investment over time.

Example: Software as a Service (SaaS) on a global scale

Microsoft® Office 365, one of the largest software as a service (SaaS) offerings currently available on a global scale, is an example of investment in change at the level of application architecture. Office 365 provides email, collaboration, and communication workloads as a cloud service; it combines Microsoft Office desktop applications with Microsoft SharePoint® Online, Microsoft Exchange Online, and Microsoft Lync™ Online.

Developing Office 365 was one of the first Microsoft initiatives to transform a major set of traditional shrink-wrapped, boxed enterprise software products into an optimized cloud service. One of the architectural principles (described earlier) that this project demonstrates is Strive for continual service improvement, which was achieved using the following phased approach:

  • Phase 1: Learn from a limited test run. In the first phase, Microsoft gained a detailed understanding of the business side of delivering online services by running its on-premises server software much like any hosting company does. For example, the project introduced commercial agreements that were similar to those used in hosting companies, and it introduced highly standardized dedicated deployments in its data centers. This early cloud offering, offered only to limited markets, was called Microsoft Business Productivity Online Suite (BPOS).
  • Phase 2: Build a new delivery infrastructure. In the second phase, Microsoft built an online service delivery infrastructure to control and automate common tasks, such as provisioning and syndication. This phase allowed the company to scale the BPOS service to a larger number of customers and to a wider geographic market.
  • Phase 3: Deliver Office 365 to the world. In the third phase, which is marked by the introduction of Office 365 — the successor to BPOS — individual workloads (email, collaboration, and so on) are delivered together as a suite of cloud optimized applications. The Office 365 online service delivery platform has been improved to such a degree that Microsoft can now also offer it as a shared platform to small businesses and educational institutions. As of the publication date of this paper, Office 365 represents Phase 3 of the evolution — and exemplifies the principle strive for continual service improvement.

This evolutionary approach allowed Microsoft to achieve the following goals:

  • Respond to new and existing markets. Deliver new, increasingly efficient, and progressively more effective services while sustaining support for existing products.
  • Establish reusable patterns. Identify standard patterns that facilitate and support the transformation of IT capabilities to cloud services.

Example: Making the Microsoft experience available to a global audience

Microsoft Services worked with each of the Microsoft product groups that helped engineer the evolutionary transformation to Office 365 to collect, consolidate, and organize information derived from first-hand experiences of their members. Any enterprise currently considering a similar shift toward delivering cloud services either internally or to their customers can take advantage of the expertise captured and documented by Microsoft Services.

This paper provides a preliminary introduction to the set of architectural principles that Microsoft used to guide technological change at a fundamental level. It also presents specific instances of the implementation of those principles in each “Mini Case Study” section later in the paper.

The Microsoft Services team also investigated the architectural changes required to create a dynamic infrastructure service that is the essential foundation for any cloud service. As one part of this effort, Microsoft Services researched the experiences of Global Foundation Services (GFS), the Microsoft group that provides much of the infrastructure on which Microsoft runs cloud services. Again, the goal of Microsoft Services is to develop a knowledge base that is useful to other global enterprises that want to enter the cloud service delivery market at an accelerated rate but without incurring excessive risk.

4 Synopsis: Re-Engineering Microsoft as a Cloud Services Provider

Making a strategic commitment to a technological innovation as elemental as the move from shrink-wrapped software purveyor to cloud services provider required Microsoft to make organizational changes, which set the context for architectural changes, which, in turn, drove changes in design.


Figure 2: Types of structural change required to achieve technological innovation

The following table summarizes the changes depicted in the preceding figure.

Structural Area  Description
Organizational Change When it adopted the cloud model, Microsoft modified its organizational structure to address the need for alternative models of control and governance. Organizational change is common at Microsoft; however, many recent changes specifically reflect shifting priorities driven by the many ways in which implementing the cloud changes the business.

Microsoft enterprise customers who incorporate cloud computing into their IT strategy will likely also encounter a need for organizational restructuring. For information about how Microsoft approached cloud-driven organizational change, see the section “Mini Case Study: Organizational Change Throughout Microsoft” later in this paper.
Architectural Change When it adopted the cloud model, Microsoft recognized that it must alter important existing enterprise products in novel and unanticipated ways. Millions of customers depend on these products to deliver well-established end-user functionality, so planning a new architecture for familiar functionality — yet ensuring that it continues to work as expected — was essential. Examples of emerging new requirements not anticipated in the original architecture included delivering these products as services at a much larger scale, to far more diverse user communities, and/or across more extensive geographies or networks.

As the adoption of cloud computing intensifies worldwide, many Microsoft enterprise customers can also expect to encounter the necessity to change existing architecture. For information about how Microsoft approached cloud-driven architectural change, see the section “Mini Case Study: Architectural Change at Microsoft Exchange” later in this paper.
Design Change When it adopted the cloud model, Microsoft quickly determined that it must make multiple design changes to the code base of many established software products. These design changes would enable these products to provide the platform upon which Microsoft delivers innovative future services to its customers, and they had to be made within the context of competing resources, time pressures, and the intrinsically demanding development of very large-scale and complex software.

Microsoft enterprise customers who decide to support cloud services will face pressures similar to those experienced by Microsoft when implementing design change in the context of resource, time, and complexity challenges. For information about how Microsoft approached design changes to code, including background information on how such changes were managed, see the section “Mini Case Study: Design Change at Microsoft SQL Server” later in this paper.

Microsoft strategic plans include responding, on an ongoing basis, to any new requirements that cloud computing brings. As with recent cloud-based innovation at Microsoft, the architectural principles described earlier in this paper provide a general framework that can guide decisions about how to implement specific organizational, architectural, and design changes. Any large enterprise about to embark on its own effort to reshape the next generation of its services to incorporate cloud computing can make use of these general principles and can gain insight from recent Microsoft experience.

5 Mini Case Study: Organizational Change Throughout Microsoft

The recent Microsoft decision to restructure its organization to align with the Take a service provider approach principle signaled a profound change in the direction of the company. Ideally, an organization would make all changes described in this paper simultaneously to fully embrace the change in direction to support cloud computing. In reality, however, such an enormous shift in strategy requires extensive planning, a phased approach, and, most importantly, leadership commitment from the highest levels and throughout the organizational structure.

Make success achievable

To achieve success, it is important that each person within Microsoft participate in the company’s commitment to the new strategy. In March 2010, in a public speech at the University of Washington, Microsoft CEO Steve Ballmer announced that everyone within the company already understood the change: “For the cloud, we’re all in.” Although Microsoft began implementing the new strategy within the company much earlier than March 2010, the confidence projected by that public statement reinforced the commitment to change. Microsoft chose to focus every element of its business on ensuring its success in the cloud. This change in direction involved an explicit decision to put into practice within the company the following principles: Take a service provider approach (to customers), and Provide incentives that change behavior (in this case, to motivate employees to contribute to achieving the new goals).

One identifier of exceptional leadership is the ability to set objectives that are achievable and realistic, and a parallel recognition that a program of poorly planned, hasty change is apt to be unsuccessful. For Microsoft, attempting too large a change too quickly would affect existing commitments to customers who continue to use the traditional products. Therefore, defining attainable goals was — and is — an explicit part of the Microsoft approach as it reorients the company toward delivering cloud services.

Specifically, Microsoft leadership decided that change must occur incrementally over a period of software product releases. This approach maintains momentum towards achieving cloud computing while simultaneously meeting existing business demands. It also motivates employees to contribute actively by defining an achievable path to success.

Introduce ‘shared goals’ and ‘shared resources’ for key teams . . .

The most significant organizational change that Microsoft undertook in response to the opportunities and risks presented by adopting the cloud model was to create a unified business group that contains both software development teams (for individual products) and the teams that provision the services that use that software. In effect, Microsoft engineered an organizational structure in which the set of business goals that lead to successful delivery of cloud services are shared by the teams that control the development resources of those same services.

For an organization whose growth had been based on a model of releasing licensed software, this transition was significant because it required the teams that release traditional licensed products and those that build and maintain the current online cloud services to share resources.

. . . but establish clear ‘ownership boundaries’

As part of developing a service provider mindset, Microsoft had to identify relevant roles and responsibilities as well as establish boundaries between service elements. Consequently, Microsoft defined clear boundaries for multiple elements at various levels, which included identifying interface owners for individual components, identifying service components at a functional level (such as infrastructure), and identifying overall ownership for a delivered service.

Establishing ‘who is responsible for what’ and identifying available capabilities also help reduce duplication of effort and contribute to achieving a more efficient utilization of resources.

As part of establishing ownership boundaries, Microsoft realized that delivering large-scale cloud services to customers required a team whose role is to build a dynamic internal infrastructure that supports cloud services. This realization led Microsoft to form Global Foundation Services (GFS), a division whose role is to deliver a consistent and predictable cloud computing service. Initially, GFS developed a small number of prescribed infrastructure services to set the standard upon which to build more comprehensive cloud services.

At present, GFS delivers the infrastructure and foundational technologies for more than 200 online, live, and cloud services for Microsoft, including Bing™, MSN®, Business Online Services, Windows Live®, Entertainment and Devices, and Windows Azure™. GFS focuses on the infrastructure, online security, and environmental awareness considerations of operating services 7x24x365 around the world.

Identify relationships and dependencies

The following diagram highlights the relationships among drivers, objectives, and benefits of cloud-driven change at Microsoft as well as the resulting changes to the business and factors that enable these changes. For example, creating GFS to deliver a standardized core infrastructure capability with cloud attributes contributed to business growth by enabling the delivery, as services, of many existing server products and simultaneously increased customer satisfaction.


Figure 3: Align organizational change to goals and drivers

6 Mini Case Study: Architectural Change at Microsoft Exchange

Microsoft Services interviewed key members of the Microsoft Exchange Server product team as part of its effort to understand what kind of architectural changes are required in a large, successful enterprise software product with a global customer base to transform an existing product to a cloud service. Exchange Server was the first major Microsoft server product team to commit to identifying and overcoming challenges associated with the substantial product changes required to enable the delivery of server capabilities in the cloud. The team started by committing to develop the live@EDU service to offer Exchange messaging and communications capabilities to educational institutions at minimal or no cost.

The Exchange Server team takes on a formidable challenge

The challenge developers faced was how to take a feature-rich, highly configurable product designed for use in a single organization (which might, or might not, be geographically distributed) and extend it for use in a multi-tenant, globally available cloud service environment. The challenge grew even more difficult when demand for the newly developed cloud service (housed at Microsoft data center facilities and offered at a comparatively lower cost than a traditional installation) grew swiftly to tens of millions of users. For example, the live@EDU service currently hosts 45 million mailboxes [Source: Interview with Exchange Server development team October 29, 2010]. Scaling to this size was not part of the design criteria at the time the product originated.

Techniques used by Exchange to implement cloud computing

The following sections summarize how the Exchange product team encountered, categorized, and solved multiple architectural challenges as they worked to transform traditional Exchange Server capabilities into a cloud service. The result is a cloud-based service that delivers novel, resilient, high-quality functionality to millions of users.

Specific examples of how the Exchange product team met and resolved some of these challenges are provided, including:

  • An analysis of how Internet latency requirements affected fundamental variables — such as networking and users — and required architectural changes to achieve functionality in the cloud.
  • An account of how unpredictable workloads and fluctuating consumption levels at an unprecedented scale required architectural changes to make cloud computing manageable.

6.1 Internet Latency — Now What?

When charged to deliver functionality as a cloud-based service, the Exchange product team quickly recognized that it could no longer develop software that depended on fast, reliable network connections between servers and between servers and clients. Internet latency — the time required for a signal to travel from one point on a network to another across the cloud — played a major role in causing a substantive architectural change, one that was required to make the service available in the cloud. Many clients that interact well with the traditional Exchange service are not tolerant of slower response times caused by Internet latency. This latency, when it occurs, negatively affects users’ perception of service availability and quality.

Resolving the Internet latency problem required rethinking the overall client/server relationship and architecture. The team realized immediately that achieving success mandated decoupling the user experience from the latency that is inherent in cloud computing. The following subsections describe some of the approaches used to achieve this decoupling.

6.1.1 Identify asynchronous transactions that do not need to ‘stop’ the user

One technique used to maintain a consistent and quick user response time is to adopt the assumptive behavior pattern. The Exchange product team recognized a pattern:

  • The system can assume that certain transactions done via the user interface (UI) are successful.
  • It is therefore possible to give the user immediate confirmation.
  • This confirmation enables the user to continue working without interruption.

For example, it is possible to return an immediate confirmation to the user for activities such as deleting email from a mailbox. The actual deletion transaction can occur asynchronously, and if a deletion occasionally fails, it is typically of little significance to the user. The email simply remains in the Inbox. This approach results in a far better user experience than forcing every user to wait each time a potentially lengthy synchronous transaction occurs.

6.1.2 Load data first that the user expects to use immediately

Preemptive data loading is another pattern that the Exchange team implemented to achieve a better user experience. By monitoring user behavior (such as users who “opted in” to participate in service improvement), the team recognized that when a user logs on to a mailbox through the web UI the user likely opens one of the email messages positioned at the “top” of the Inbox. To support this typical user preference, when the UI loads the system fetches and preloads into the UI a percentage of the messages at the top of the user’s Inbox. This preemptive data loading enables an “instant response” when the user clicks a message to open it.

6.1.3 Maintain user functionality even during a failure

A disconnection between the UI and other elements of the Exchange service — a characteristic that occurs in both of the preceding examples — also occurs deeper in the Exchange product architecture. In a broader scenario, such as a failure scenario, the use of code isolation throughout the Exchange cloud computing service lets the service maintain user functionality through the UI even when significant outages occur elsewhere in the system. To implement code separation, also referred to as code isolation, developers separate the code for visible elements (the UI) from the rest of the code (which provides functionality not visible to the user).

Examples in which the product team identified Exchange-specific patterns that helped them discover areas in which it was possible to solve Exchange-specific technical issues (ones that arose from Internet latency) are instances of two more broadly applicable general architectural principles. One of these principles is Take a holistic approach to availability design, and the other is Deliver the perception of continuous service availability.

Taking a holistic approach to availability design and delivering the perception of continuous service availability are also directly applicable to the following scenarios:

  • Moving an application from a traditional hosted environment to a shared environment.
  • Supporting an application from a single data center that is used by customers across disparate client environments, each with different network requirements (for example, networks that support an organization located in a single site, an international organization with offices around the globe, or an organization with a large mobile workforce).

6.2 Unpredictable Workloads and Consumption — On a Global Scale

Users of traditional Microsoft server software products, such as Exchange Server, work for an enterprise that purchases the product. Typically, members of an internal IT department manage and support the product. The IT department is, of course, familiar with various employee roles and requirements as well as with the policies and requirements of the organization. Thus, in a traditional scenario, the people who manage the software also have sufficient knowledge to implement some level of governance over what users can do with that software. Knowledge of typical behavior patterns within an organization allows that organization to design, manage, and tune enterprise software to perform as efficiently as possible for specific groups of users.

6.2.1 Quarantine an individual mailbox if it disrupts a shared service environment

By contrast, in a multi-tenant software-provisioning model, multiple organizations access software provided by an external organization (for example, by Microsoft cloud services). In such a scenario, the traditional relationship between an internal IT department and users who are fellow employees of the same organization no longer exists. In the context of email provided as a cloud service, the multi-tenant model can potentially lead to the introduction of many more instances of disruptive events (or “poison”) into the email system.

Occasionally, “poison” events are malicious. Typically, however, they result from a lack of knowledge of predictable user behavior patterns and therefore result from a lack of effective governance over user behavior. A shared service environment that includes a diverse mixture of usage profiles presents a management challenge because workload profiles fluctuate unpredictably, with little consistency in the peaks and valleys. Fluctuations occur because an individual tenant’s drivers are unknown to the service provider and because behavior patterns typical of that tenant might be very different from those of other tenants. For example, very large email messages sent within one organization to a very large number of individuals at “peak” times for the overall system can cause disruption across the shared service environment.

Microsoft quickly identified this type of initially unanticipated scenario as an area that required a new approach. The Exchange product team investigated various technical approaches to mitigate such risks. One appropriate response, which is based on the architectural principle Make service attributes predictable and which is beneficial both to the service provider and to consumers, is to implement a quarantine model. A quarantine model also supports the Minimize human involvement principle by automating the approach and handling the scenario dynamically.

In contrast to traditional Exchange Server (which has no “unit of consumption” because it is a purchased product, not a cloud service), Exchange Online defines an individual user’s mailbox to be the unit of consumption. A unit of consumption is the scope by which a service is consumed and charged (priced). Therefore, one technique to maintain consistent levels of service when a single mailbox causes a disruption across a multi-tenant cloud service is to quarantine the individual mailbox from which the service disruption originates.

A service can perform a quarantine action by using a workload profiling and management approach (described in the following subsection) to identify which workloads are affected and to determine which individual mailbox caused the reduction in service quality (referred to as degradation). The result might be a total quarantine of that mailbox, or it might be merely a throttling (restriction) of requests, resulting in a reduced level of service to that mailbox.

6.2.2 Profile workloads to better manage them in a multi-tenant cloud environment

Workload consumption profiles that vary from one tenant to another in a multi-tenant cloud environment present a considerable challenge. To maintain quality of service across the system, the Exchange product team needed to define more structured categories for the system’s combined workload (which, in the traditional version of Exchange Server, is divided into and profiled as individual transactions).

The team defined categories for a number of workloads based on logical activities. The purpose of the categorization was to distinguish one transaction from another and thus to allow for more accurate workload profiling. For example, based on profiling, the system can assign specific workloads relative priority levels; alternatively, the system can scale out to accommodate workloads more effectively.

When sporadic intense activity occurs in a specific workload, profiling enables the service to apply quarantine or throttling rules to (only) that workload to prevent it from affecting the quality of service for other workloads in the system. For Exchange Online, intermittent high-activity workloads might result from mailbox migrations or from mobile device synchronization.

The product team introduced into the system a workload ID that is tracked, just like transaction IDs, to facilitate workload profiling. The ability to track activity by workload ID enables monitoring of that activity and, therefore, also enables a subsequent dynamic reaction to rapidly changing consumption profiles.

The Exchange product team also implemented code separation so that the system can handle each workload independently. For example, this capability lets the system throttle resources with precision.

Design changes such as these are Exchange-specific instances of the general architectural principle Strive for continual service improvement. These changes make it easier to manage resources for each workload in a multi-tenant cloud environment; they also make it easier to identify associated patterns that developers can use as a basis for making workload management more effective.

The workload profiling approach described in this section applies to applications that incorporate the following characteristics:

  • Significant variations in workload, such as applications that serve overlapping international audiences
  • Tasks that can consume significant resources, such as batch tasks, migrations, reporting, and so on

6.2.3 Combine patterns to mitigate fluctuating consumption

As described earlier, a technique that can help transform an existing server product into a cloud service is to identify and apply a pattern. In addition, the Exchange product team discovered that it is often useful to combine a specific pattern with one or more other patterns (or with other related components).

For example, the “tarpitting” pattern — established years ago to protect network resources from over-subscription during activity peaks — can also help protect a cloud service from abnormal activity. Tarpitting is suitable for handling either natural peaks of consumption or peaks caused by malicious activities. Effective use of tarpitting for cloud computing requires that it be combined with a number of other components and that it rely on a number of available services.

Tarpitting Pattern Components

Pattern Component

Description

Tarpitting Initiates a temporary delay in processing requests, triggered by patterns that indicate unusually high activity, to preserve resources. Tarpitting reduces service quality but maintains availability.
System monitor Captures activity in the system, helps identify (and sends notifications about) abnormal behavior, and acts as a reference for available resources.
Resource monitor Identifies any available resources currently under stress by monitoring system monitor data and recommending (to the ‘single authority’) the application of a tarpit to help preserve remaining resources.
Single authority Identifies an authority in the system that can make an informed decision on any flagged activity, taking into account knowledge gathered from the resource monitor about currently available resources.
Quarantine An isolating action applied to requests (either specific or general) from service consumers for a specified time. Quarantine occurs after an activity is identified by the single authority to be negatively affecting the system.
Error containment A technique that enables the single authority to recognize when a significant number of anomalies exist in the system so it can trigger a call for human attention. One result might be to reset thresholds to allow naturally increasing consumption patterns.

The Exchange product team, like other Microsoft teams, has identified and applied a number of useful patterns, sometimes in conjunction with related services, during its phased transition to cloud computing. Understanding and applying patterns such as these are crucial for the successful creation of cloud services. These patterns accelerate the architectural process, help reduce risk, and help attain quality levels expected from cloud services. You can find patterns that are relevant to the transformation from a software product to a cloud service in the book Patterns for Fault Tolerant Software, 2007, by Robert Hanmer.

6.2.4 Adopt scale units in a cloud topology to improve service quality and profit

The ability to scale a service dynamically is of critical importance in cloud environments. The traditional topology of an Exchange system consists of a three-tier client/server model: a client tier, a backend database tier, and a hub tier for routing mail between systems. Each tier has a different resource use profile, but all are required to scale out the system.

When considering how to enable a system to scale, a means to measure the consumption of the service is required. Identifying a consumption unit makes it possible to achieve the following:

  • Assign resource allocations, which are used to enable more predictable growth across a complex system.
  • Adopt a scale unit concept, which is used to provide a predictable, consistent, and manageable way in which to scale and manage a complex system.

For Exchange (as explained earlier) the unit of consumption is a mailbox. At first, the Exchange product team configured the scale unit to include one client-tier server and 1,000 data stores, each of which had a set number of mailbox allocations. However, this scale unit configuration provides non-linear scaling because the client tier consumption profile is comparable to an individual data tier. To solve this problem, the Exchange team identified a scale unit that included one instance of each tier within the unit, each of which has physical resource allocations. This approach, which results in one unit from each tier, achieves a configuration that is closer to linear scale.

One scale unit contains:

With this topology, the cloud-based version of Exchange adheres to the architectural principle Optimize use of available resources. The system monitors scale units for resource consumption; each scale unit is expected to operate at up to a defined percent utilization.

The utilization rate should:

  • Permit workloads to migrate from one scale unit to another if a failure occurs
  • Allow for an additional percent peak activity even at full utilization of both workloads (base % + failure %). This is, essentially, a resource pooling concept

When resource utilization rises consistently above 60 percent, an additional scale unit is deployed to handle the increased workload.

7 Mini Case Study: Design Change at Microsoft SQL Server

Microsoft Services interviewed key members of the Microsoft SQL Server® development team to understand how the team modified the large, complex code base of this core Microsoft product to enable new capabilities. The SQL Server team described and categorized some of the key changes that they made.

One major change was to develop new deployment models that require the separation of previously tightly coupled functionality. Modularizing functionality at the level of code can, for example, contribute to improved performance. In the case of SQL Server, separated functionality is deployed to physically separate server instances to improve performance of the overall service. (Windows Server® and Windows® client operating system development teams overcame similar challenges when they rebuilt the operating system to remove dependencies among Windows components.)

7.1 Componentization and Abstraction

Componentization and abstraction are useful techniques for managing a complex code base, like that of SQL Server. The goal is to divide a large problem into smaller components with more manageable attributes.

The SQL Server team used visualization and modeling tools provided by the Visual Studio® 2010 development system to work through the code base, detail the interactions between components, and establish a map of those interactions. First, layers of abstraction or componentization were proposed, and then the traffic crossing the lines within the map was measured in terms of the performance impact of cross-boundary traffic.

Using this data, only decisions that were justifiable in terms of having negligible performance impact were proposed. The advantage of this approach is that design improvements brought about by componentization were implemented based on real usage data rather than on a theoretical design. In some cases, this analysis of cross-boundary traffic actually resulted in improvements in performance through functional modification (in addition to the advantages conferred by componentization). For example, in one function, the team determined that, when calling a lower-level function, the component made unnecessary calls back up the component stack, which resulted in redundant cycle utilization that affected performance.

When the SQL Server team proposed lines of demarcation within the code base, they focused both on the standalone value of a given component and also on its suitability for reuse.

The team also focused on business value when identifying components for separation. A good example is the language parser built into the product. The language parser is used widely across many areas of the product, and in other products across the business. Microsoft teams had already recognized that the parser would be more valuable if it were available as a standalone component. In this case, the effort to create a more compact and efficient language services component delivered value both for SQL Server and for other businesses areas (for example, the Visual Studio team).

Working through an established code base, such as SQL Server, can evoke skepticism about whether or not proposed changes are beneficial. To counter this skepticism and to avoid wasting time in debate, the SQL Server team relied on facts. For each instance of proposed change, they created functional tests, based on measurement and analysis tools, from which to draw data to create a clear business case. Making the business case for a proposed change also helped to ensure the efficient utilization of valuable resources in a highly productive product group.

One empirical asset that the team created was a map of the complexity of each component and its interactions with other components or services. This map helped make the business case for a given component, because more complex code paths are more likely to cause support difficulties than simpler code paths. Therefore, for any component with complex code paths, the team used the map to make a case for change by targeting the issue of support cost. Producing a product whose code paths and execution are clean and well documented, and consequently easier to support, reduces the lifecycle cost of the product and increases customer satisfaction.

7.2 Performance Optimization

The introduction of componentization and abstraction typically introduce performance overhead imposed by the cost of crossing boundaries. SQL Server is a high-performance product, and its reputation for responsiveness is a critical quality attribute. Every new version carries expectations of improved performance as well as new product features. Consequently, it was important that any design change not cause any degradation in performance. The team achieved this goal through the effective use of measurement and analysis, as described earlier.

One of the more detailed elements of code refactoring was in understanding the performance impact of various different styles of code placement. The team determined that virtual functions have a performance profile that is similar to the profile for non-inline calls. This recognition meant that, if it was established that within a particular code path a non-inline function would not cause any issues, then neither would a virtual function call cause issues. With this knowledge, the SQL Server team used virtual functions as a means to provide encapsulation for some components. This required careful planning to prevent the introduction of virtual functions into critical code paths where they could introduce unacceptable performance overheads. This change was achieved by code sharing, linking some key functions in multiple binary files so as not to cause a cross-binary function call at runtime. The use of direct data exports also helped to prevent calls across boundaries.

Recognition of this benefit of virtual functions was reinforced when the SQL Server team worked with the compiler teams to understand the intelligence in modern compilers. Compilers now perform inlining of functions optimally by taking advantage of the profile guided optimization switches in the Visual C++® compiler. Increasing the use of virtual functions supports the principle Make service attributes predictable by simplifying the code path to achieve predictable and more efficient performance. Using more virtual functions also supports the principle Optimize use of available resources. Finally, this analysis is also an example of applying the principle Strive for continual service improvement.

7.3 Guiding Future Change

One challenge encountered by the SQL Server development team was how to implement governance to ensure that, once a component is isolated for use, it is not duplicated in future development. To help prevent such duplication, the team adopted a policy that all planning for new functionality must first establish its dependencies on any existing components. The team used a peer review process to approve this policy.

In addition, when team members determine that a new component is required, they create the component and then use rigorous unit tests to analyze and profile the component. The goal is to ensure that all calls in and out of the component adhere to the newly established performance and optimization guidelines that apply throughout the product.

8 Next Steps: How You Can Build On the Microsoft Experience

Microsoft customers and partners who plan to enter into the world of cloud computing as service providers can take advantage of the Microsoft experience described in this paper to evaluate how they might want to handle similar issues. By describing its own journey to the cloud and lessons learned, Microsoft can help customers plan and carry out an effective strategy to make the most of opportunities presented by the cloud.

Important factors for a successful transformation include:

  • Strong leadership. The ability to motivate employees to welcome the challenge to ‘move mountains’ is essential.
  • Achievable goals. Goals, although complex, extensive, and sometimes formidable, must nevertheless be realistic and ultimately achievable.
  • Incremental change. The shift in direction and focus from a pre-cloud product delivery model to the role of cloud service provider must be incremental, for two reasons: to achieve success, and to avoid disrupting support for existing customers.
  • Solid architectural principles. Principles that are designed to guide the direction of such a paradigm change are crucial.

Greater opportunities exist now than ever before to deliver higher levels of service with increased functionality and lower investments over the long-term, resulting in a more agile and supportable portfolio of assets. The Microsoft Services team is ideally placed to assist enterprise customers achieve this transformation.

9 References

Hanmer, Robert. Patterns for Fault Tolerant Software. Hoboken, NJ: John Wiley, 2007.

Microsoft Services. Interviews with members of the following software development teams at Microsoft:

  • Microsoft Global Foundation Services (September 2010)
  • Microsoft Exchange Server, Exchange Online, live@edu (October 2010)
  • Microsoft SQL Server (October 2010)

Microsoft IT Showcase. “Microsoft IT Enterprise Architecture and the Cloud

Microsoft Presspass. “The Economics of the Cloud” November 2010

MSDN Visual Studio. “Profile-Guided Optimizations