Home > Articles > Data Center Architecture and Technologies in the Cloud

Data Center Architecture and Technologies in the Cloud

Chapter Description

This chapter provides an overview of the architectural principles and infrastructure designs needed to support a new generation of real-time-managed IT service use cases in the data center.

Architectural Building Blocks of a Data Center

Data center design is at an evolutionary crossroads. Massive data growth, challenging economic conditions, and the physical limitations of power, heat, and space are exerting substantial pressure on the enterprise. Finding architectures that can take cost, complexity, and associated risk out of the data center while improving service levels has become a major objective for most enterprises. Consider the challenges facing enterprise IT organizations today.

Data center IT staff is typically asked to address the following data center challenges:

  • Improve asset utilization to reduce or defer capital expenses.
  • Reduce capital expenses through better management of peak workloads.
  • Make data and resources available in real time to provide flexibility and alignment with current and future business agility needs.
  • Reduce power and cooling consumption to cut operational costs and align with "green" business practices.
  • Reduce deployment/churn time for new/existing services, saving operational costs and gaining competitive advantage in the market.
  • Enable/increase innovation through new consumption models and the adoption of new abstraction layers in the architecture.
  • Improve availability of services to avoid or reduce the business impact of unplanned outages or failures of service components.
  • Maintain information assurance through consistent and robust security posture and processes.

From this set of challenges, you can derive a set of architectural principles that a new services platform would need to exhibit (as outlined in Table 3-1) to address the aforementioned challenges. Those architectural principles can in turn be matched to a set of underpinning technological requirements.

Table 3-1. Technology to Support Architectural Principles

Architectural Principles

Technological Requirements

Efficiency

Virtualization of infrastructure with appropriate management tools. Infrastructure homogeneity is driving asset utilization up.

Scalability

Platform scalability can be achieved through explicit protocol choice (for example, TRILL) and hardware selection and also through implicit system design and implementation.

Reliability

Disaster recovery (BCP) planning, testing, and operational tools (for example, VMware's Site Recovery Manager, SNAP, or Clone backup capabilities).

Interoperability

Web-based (XML) APIs, for example, WSDL (W3C) using SOAP or the conceptually simpler RESTful protocol with standards compliance semantics, for example, RFC 4741 NETCONF or TMForum's Multi-Technology Operations Systems Interface (MTOSI) with message binding to "concrete" endpoint protocols.

Flexibility

Software abstraction to enable policy-based management of the underlying infrastructure. Use of "meta models" (frames, rules, and constraints of how to build infrastructure). Encourage independence rather than interdependence among functional components of the platform.

Modularity

Commonality of the underlying building blocks that can support scale-out and scale-up heterogeneous workload requirements with common integration points (web-based APIs). That is, integrated compute stacks or infrastructure packages (for example, a Vblock or a FlexPod). Programmatic workflows versus script-based workflows (discussed later in this chapter) along with the aforementioned software abstraction help deliver modularity of software tools.

Security

The appropriate countermeasures (tools, systems, processes, and protocols) relative to risk assessment derived from the threat model. Technology countermeasures are systems based, security in depth. Bespoke implementations/design patterns required to meet varied hosted tenant visibility and control requirements necessitated by regulatory compliance.

Robustness

System design and implementation—tools, methods, processes, and people that assist to mitigate collateral damage of a failure or failures internal to the administratively controlled system or even to external service dependencies to ensure service continuity.

Industry Direction and Operational and Technical Phasing

New technologies, such as multicore CPU, multisocket motherboards, inexpensive memory, and Peripheral Component Interconnect (PCI) bus technology, represent an evolution in the computing environment. These advancements, in addition to abstraction technologies (for example, virtual machine monitors [VMM], also known as hypervisor software), provide access to greater performance and resource utilization at a time of exponential growth of digital data and globalization through the Internet. Multithreaded applications designed to use these resources are both bandwidth intensive and require higher performance and efficiency from the underlying infrastructure.

Over the last few years, there have been iterative developments to the virtual infrastructure. Basic hypervisor technology with relatively simple virtual switches embedded in the hypervisor/VMM kernel have given way to far more sophisticated third-party distributed virtual switches (DVS) (for example, the Cisco Nexus 1000V) that bring together the operational domains of virtual server and the network, delivering consistent and integrated policy deployments. Other use cases, such as live migration of a VM, require orchestration of (physical and virtual) server, network, storage, and other dependencies to enable uninterrupted service continuity. Placement of capability and function needs to be carefully considered. Not every capability and function will have an optimal substantiation as a virtual entity; some might require physical substantiation because of performance or compliance reasons. So going forward, we see a hybrid model taking shape, with each capability and function being assessed for optimal placement with the architecture and design.

Although data center performance requirements are growing, IT managers are seeking ways to limit physical expansion by increasing the utilization of current resources. Server consolidation by means of server virtualization has become an appealing option. The use of multiple virtual machines takes full advantage of a physical server's computing potential and enables a rapid response to shifting data center demands. This rapid increase in computing power, coupled with the increased use of VM environments, is increasing the demand for higher bandwidth and at the same time creating additional challenges for the supporting networks.

Power consumption and efficiency continue to be some of the top concerns facing data center operators and designers. Data center facilities are designed with a specific power budget, in kilowatts per rack (or watts per square foot). Per-rack power consumption and cooling capacity have steadily increased over the past several years. Growth in the number of servers and advancement in electronic components continue to consume power at an exponentially increasing rate. Per-rack power requirements constrain the number of racks a data center can support, resulting in data centers that are out of capacity even though there is plenty of unused space.

Several metrics exist today that can help determine how efficient a data center operation is. These metrics apply differently to different types of systems, for example, facilities, network, server, and storage systems. For example, Cisco IT uses a measure of power per work unit performed instead of a measure of power per port because the latter approach does not account for certain use cases—the availability, power capacity, and density profile of mail, file, and print services will be very different from those of mission-critical web and security services. Furthermore, Cisco IT recognizes that just a measure of the network is not indicative of the entire data center operation. This is one of several reasons why Cisco has joined The Green Grid (www.thegreengrid.org), which focuses on developing data center–wide metrics for power efficiency. The power usage effectiveness (PUE) and data center efficiency (DCE) metrics detailed in the document "The Green Grid Metrics: Describing Data Center Power Efficiency" are ways to start addressing this challenge. Typically, the largest consumer of power and the most inefficient system in the data center is the Computer Room Air Conditioning (CRAC). At the time of this writing, state-of-the-art data centers have PUE values in the region of 1.2/1.1, whereas typical values would be in the range of 1.8–2.5. (For further reading on data center facilities, check out the book Build the Best Data Center Facility for Your Business, by Douglas Alger from Cisco Press.)

Cabling also represents a significant portion of a typical data center budget. Cable sprawl can limit data center deployments by obstructing airflows and requiring complex cooling system solutions. IT departments around the world are looking for innovative solutions that will enable them to keep up with this rapid growth with increased efficiency and low cost. We will discuss Unified Fabric (enabled by virtualization of network I/O) later in this chapter.

Current Barriers to Cloud/Utility Computing/ITaaS

It's clear that a lack of trust in current cloud offerings is the main barrier to broader adoption of cloud computing. Without trust, the economics and increased flexibility of cloud computing make little difference. For example, from a workload placement perspective, how does a customer make a cost-versus-risk (Governance, Risk, Compliance [GRC]) assessment without transparency of the information being provided? Transparency requires well-defined notations of service definition, audit, and accountancy. Multiple industry surveys attest to this. For example, as shown in Figure 3-3, Colt Technology Services' CIO Cloud Survey 2011 shows that most CIOs consider security as a barrier to cloud service adoption, and this is ahead of standing up the service (integration issues)! So how should we respond to these concerns?

Figure 3-3

Figure 3-3 CTS' CIO Cloud Survey 2011 (www.colt.net/cio-research)

Trust in the cloud, Cisco believes, centers on five core concepts. These challenges keep business leaders and IT professionals alike up at night, and Cisco is working to address them with our partners:

  • Security: Are there sufficient information assurance (IA) processes and tools to enforce confidentiality, integrity, and availability of the corporate data assets? Fears around multitenancy, the ability to monitor and record effectively, and the transparency of security events are foremost in customers' minds.
  • Control: Can IT maintain direct control to decide how and where data and software are deployed, used, and destroyed in a multitenant and virtual, morphing infrastructure?
  • Service-level management: Is it reliable? That is, can the appropriate Resource Usage Records (RUR) be obtained and measured appropriately for accurate billing? What if there's an outage? Can each application get the necessary resources and priority needed to run predictably in the cloud (capacity planning and business continuance planning)?
  • Compliance: Will my cloud environment conform with mandated regulatory, legal, and general industry requirements (for example, PCI DSS, HIPAA, and Sarbanes-Oxley)?
  • Interoperability: Will there be a vendor lock-in given the proprietary nature of today's public clouds? The Internet today has proven popular to enterprise businesses in part because of the ability to reduce risk through "multihoming" network connectivity to multiple Internet service providers that have diverse and distinct physical infrastructures.

For cloud solutions to be truly secure and trusted, Cisco believes they need an underlying network that can be relied upon to support cloud workloads.

To solve some of these fundamental challenges in the data center, many organizations are undertaking a journey. Figure 3-4 represents the general direction in which the IT industry is heading. The figure maps the operational phases (Consolidation, Virtualization, Automation, and so on) to enabling technology phases (Unified Fabric, Unified Computing, and so on).

Figure 3-4

Figure 3-4 Operational and Technological Evolution Stages of IT

Organizations that are moving toward the adoption and utilization of cloud services tend to follow these technological phases:

  1. Adoption of a broad IP WAN that is highly available (either through an ISP or self-built over dark fiber) enables centralization and consolidation of IT services. Application-aware services are layered on top of the WAN to intelligently manage application performance.
  2. Executing on a virtualization strategy for server, storage, networking, and networking services (session load balancing, security apps, and so on) enables greater flexibility in the substantiation of services in regard to physical location, thereby enabling the ability to arrange such service to optimize infrastructure utilization.
  3. Service automation enables greater operational efficiencies related to change control, ultimately paving the way to an economically viable on-demand service consumption model. In other words, building the "service factory."
  4. Utility computing model includes the ability meter, chargeback, and bill customer on a pay-as-you-use (PAYU) basis. Showback is also a popular service: the ability to show current, real-time service and quota usage/consumption including future trending. This allows customers to understand and control their IT consumption. Showback is a fundamental requirement of service transparency.
  5. Market creation through a common framework incorporating governance with a service ontology that facilitates the act of arbitrating between different service offerings and service providers.

Phase 1: The Adoption of a Broad IP WAN That Is Highly Available

This connectivity between remote locations allows IT services that were previously distributed (both from a geographic and organizational sense) to now be centralized, providing better operational control over those IT assets.

The constraint of this phase is that many applications were written to operate over a LAN and not a WAN environment. Rather than rewriting applications, the optimal economic path forward is to utilize application-aware, network-deployed services to enable a consistent Quality of Experience (QoE) to the end consumer of the service. These services tend to fall under the banner of Application Performance Management (APM) (www.cisco.com/go/apm). APM includes capabilities such as visibility into application response times, analysis of which applications and branch offices use how much bandwidth, and the ability to prioritize mission-critical applications, such as those from Oracle and SAP, as well as collaboration applications such as Microsoft SharePoint and Citrix.

Specific capabilities to deliver APM are as follows:

  • Performance monitoring: Both in the network (transactions) and in the data center (application processing).
  • Reporting: For example, application SLA reporting requires service contextualization of monitoring data to understand the data in relation to its expected or requested performance parameters. These parameters are gleaned from who the service owner is and the terms of his service contract.
  • Application visibility and control: Application control gives service providers dynamic and adaptive tools to monitor and assure application performance.

Phase 2: Executing on a Virtualization Strategy for Server, Storage, Networking, and Networking Services

There are many solutions available on the market to enable server virtualization. Virtualization is the concept of creating a "sandbox" environment, where the computer hardware is abstracted to an operating system. The operating system is presented generic hardware devices that allow the virtualization software to pass messages to the physical hardware such as CPUs, memory, disks, and networking devices. These sandbox environments, also known as virtual machines (VM), include the operating system, the applications, and the configurations of a physical server. VMs are hardware independent, making them very portable so that they can run on any server.

Virtualization technology can also be applicable to many different areas such as networking and storage. LAN switching, for example, has the concept of a virtual LAN (VLAN) and routing with Virtual Routing and Forwarding (VRF) tables; storage-area networks have something similar in terms of virtual storage-area networks (VSAN), vFiler for NFS storage virtualization, and so on.

However, there is a price to pay for all this virtualization: management complexity. As virtual resources become abstracted from physical resources, existing management tools and methodologies start to break down in regard to their control effectiveness, particularly when one starts adding scale into the equation. New management capabilities, both implicit within infrastructure components or explicitly in external management tools, are required to provide the visibility and control service operations teams required to manage the risk to the business.

Unified Fabric based on IEEE Data Center Bridging (DCB) standards (more later) is a form of abstraction, this time by virtualizing Ethernet. However, this technology unifies the way that servers and storage resources are connected, how application delivery and core data center services are provisioned, how servers and data center resources are interconnected to scale, and how server and network virtualization is orchestrated.

To complement the usage of VMs, virtual applications (vApp) have also been brought into the data center architecture to provide policy enforcement within the new virtual infrastructure, again to help manage risk. Virtual machine-aware network services such as VMware's vShield and Virtual Network Services from Cisco allow administrators to provide services that are aware of tenant ownership of VMs and enforce service domain isolation (that is, the DMZ). The Cisco Virtual Network Services solution is also aware of the location of VMs. Ultimately, this technology allows the administrator to tie together service policy to location and ownership of an application residing with a VM container.

The Cisco Nexus 1000V vPath technology allows policy-based traffic steering to "invoke" vApp services (also known as policy enforcement points [PEP]), even if they reside on a separate physical ESX host. This is the start of Intelligent Service Fabrics (ISF), where the traditional IP or MAC-based forwarding behavior is "policy hijacked" to substantiate service chain–based forwarding behavior.

Server and network virtualization have been driven primarily by the economic benefits of consolidation and higher utilization of physical server and network assets. vApps and ISF change the economics through efficiency gains of providing network-residing services that can be invoked on demand and dimensioned to need rather than to the design constraints of the traditional traffic steering methods.

Virtualization, or rather the act of abstraction from the underlying physical infrastructure, provides the basis of new types of IT services that potentially can be more dynamic in nature, as illustrated in Figure 3-5.

Figure 3-5

Figure 3-5 IT Service Enablement Through Abstraction/Virtualization of IT Domains

Phase 3: Service Automation

Service automation, working hand in hand with a virtualized infrastructure, is a key enabler in delivering dynamic services. From an IaaS perspective, this phase means the policy-driven provisioning of IT services though the use of automated task workflow, whether that involves business tasks (also known as Business Process Operations Management [BPOM]) or IT tasks (also known as IT Orchestration).

Traditionally, this has been too costly to be economically effective because of the reliance on script-based automation tooling. Scripting is linear in nature (makes rollback challenging); more importantly, it tightly couples workflow to process execution logic to assets. In other words, if an architect wants or needs to change an IT asset (for example, a server type/supplier) or change the workflow or process execution logic within a workflow step/node in response to a business need, a lot of new scripting is required. It's like building a LEGO brick wall with all the bricks glued together. More often than not, a new wall is cheaper and easier to develop than trying to replace or change individual blocks.

Two main developments have now made service automation a more economically viable option:

  • Standards-based web APIs and protocols (for example, SOAP and RESTful) have helped reduce integration complexity and costs through the ability to reuse.
  • Programmatic-based workflow tools helped to decouple/abstract workflow from process execution logic from assets. Contemporary IT orchestration tools, such as Enterprise Orchestrator from Cisco and BMC's Atrium Orchestrator, allow system designers to make changes to the workflow (including invoking and managing parallel tasks) or to insert new workflow steps or change assets through reusable adaptors without having to start from scratch. Using the LEGO wall analogy, individual bricks of the wall can be relatively easily interchanged without having to build a new wall.

Note that a third component is necessary to make programmatic service automation a success, namely, an intelligent infrastructure by which the complexity of the low-level device configuration syntax is abstracted from the northbound system's management tools. This means higher-level management tools only need to know the policy semantics. In other words, an orchestration system need only ask for a chocolate cake and the element manager, now based on a well-defined (programmatic) object-based data model, will translate that request into the required ingredients and, furthermore, how they those ingredients should be mixed together and in what quantities.

A practical example is the Cisco Unified Compute System (UCS) with its single data model exposed through a single transactional-based rich XML API (other APIs are supported!). This allows policy-driven consumption of the physical compute layer. To do this, UCS provides a layer of abstraction between its XML data model and the underlying hardware through application gateways that do the translation of the policy semantics as necessary to execute state change of a hardware component (such as BIOS settings).

Phase 4: Utility Computing Model

This phase involves the ability to monitor, meter, and track resource usage for chargeback billing. The goal is for self-service provisioning (on-demand allocation of compute resources), in essence turning IT into a utility service.

In any IT environment, it is crucial to maintain knowledge of allocation and utilization of resources. Metering and performance analysis of these resources enable cost efficiency, service consistency, and subsequently the capabilities IT needs for trending, capacity management, threshold management (service-level agreements [SLA]), and pay-for-use chargeback.

In many IT environments today, dedicated physical servers and their associated applications, as well as maintenance and licensing costs, can be mapped to the department using them, making the billing relatively straightforward for such resources. In a shared virtual environment, however, the task of calculating the IT operational cost for each consumer in real time is a challenging problem to solve.

Pay for use, where the end customers are charged based on their usage and consumption of a service, has long been used by such businesses as utilities and wireless phone providers. Increasingly, pay-per-use has gained acceptance in enterprise computing as IT works in parallel to lower costs across infrastructures, applications, and services.

One of the top concerns of IT leadership teams implementing a utility platform is this: If the promise of pay-per-use is driving service adoption in a cloud, how do the providers of the service track the service usage and bill for it accordingly?

IT providers have typically struggled with billing solution metrics that do not adequately represent all the resources consumed as part of a given service. The primary goal of any chargeback solution requires consistent visibility into the infrastructure to meter resource usage per customer and the cost to serve for a given service. Today, this often requires cobbling together multiple solutions or even developing custom solutions for metering.

This creates not only up-front costs, but longer-term inefficiencies. IT providers quickly become overwhelmed building new functionality into the metering system every time they add a service or infrastructure component.

The dynamic nature of a virtual converged infrastructure and its associated layers of abstraction being a benefit to the IT operation conversely increase the metering complexity. An optimal chargeback solution provides businesses with the true allocation breakdown of costs and services delivered in a converged infrastructure.

The business goals for metering and chargeback typically include the following:

  • Reporting on allocation and utilization of resources by business unit or customer
  • Developing an accurate cost-to-serve model, where utilization can be applied to each user
  • Providing a method for managing IT demand, facilitating capacity planning, forecasting, and budgeting
  • Reporting on relevant SLA performance

Chargeback and billing requires three main steps:

  • step 1. Data collection
  • step 2. Chargeback mediation (correlating and aggregating data collected from the various system components into a billing record of the service owner customer)
  • step 3. Billing and reporting (applying the pricing model to collected data) and generating a periodic billing report

Phase 5: Market

In mainstream economics, the concept of a market is any structure that allows buyers and sellers to exchange any type of goods, services, and information. The exchange of goods or services for money (an agreed-upon medium of exchange) is a transaction.

For a marketplace to be built to exchange IT services as an exchangeable commodity, the participants in that market need to agree on common service definitions or have an ontology that aligns not only technology but also business definitions. The alignment of process and governance among the market participants is desirable, particularly when "mashing up" service components from different providers/authors to deliver an end-to-end service.

To be more detailed, a service has two aspects:

  • Business: The business aspect is required for marketplace and a technical aspect for exchange and delivery. The business part needs product definition, relationships (ontology), collateral, pricing, and so on.
  • Technical: The technical aspect needs fulfillment, assurance, and governance aspects.

In the marketplace, there will be various players/participants who take on a variety and/or combination of roles. There would be exchange providers (also known as service aggregators or cloud service brokers), service developers, product manufacturers, service providers, service resellers, service integrators, and finally consumers (or even prosumers).

3. Design Evolution in the Data Center | Next Section Previous Section