Explicit business and operational requirements for any Internet business solution are key to its success. Surprisingly high numbers of software application projects start and sometimes launch without addressing the question of requirements.
You can learn your basic application requirements by conducting a needs assessment, which not only dictates the metrics upon which a project is judged a success, but also helps create a realistic project plan and schedule. The features and functionality requirements within a needs assessment must be detailed enough so that a basic understanding of feasibility can be evaluated before you begin the planning and testing phases.
The discussion in this chapter is divided into the following sections:
- Basic Application Requirements
- Integration Concepts
Basic Application Requirements
Most applications created to provide Internet business solutions share a set of basic features for specific, nonfunctional requirements. These minimum requirements exist, regardless of the application's specific purpose or functionality. The basic, nonfunctional application requirements are
Each of these requirements is an important criterion to consider when creating an application needs assessment, and in making sure that you identify the basic requirements. Each is discussed in the following sections.
Security is measures and controls taken to protect economic or other important assets. The key to achieving a secure application is to ensure that you include security considerations during the application's design. Too often, the need to implement Internet business applications quickly to shorten time-to-market schedules leads developers to neglect application security. Unfortunately, this increases the likelihood of security flaws and vulnerabilities in the application, and of further costs.
As security problems emerge during production use, immediate patches and modifications are required to correct the problems. Such a reactive approach can cause reduced confidence in an application's reliability and integrity. Furthermore, the approach is a negative reflection on those who were responsible for the application's design and creation. Ultimately though, for those who rely on using the application, questions about security center around the lack of assurance that their personal information is adequately protectedmany times even resulting in loss.
This section describes the fundamental security elements that require attention during an application's design. Carefully integrating these basic security functions with the application can offer secure and reliable use of the application. Practice research analysis to make the right decision based on a complete risk assessment.
A thorough discussion of security is beyond the scope of this study guide. For references to several texts that address the subject of application security in great depth, see Appendix A, "Recommended Further Reading."
Security is also concerned with three broad topics: policy, procedures, and compliance. Policy details what you are protecting; procedure details how you plan to maintain your security policies; and compliance details how well you manage to execute your security policies. Your security procedures are the implementations of your security policies. The amount of effort required to secure your application and systems depends on the nature of the application. For example, applications that require no customer information require less rigorous configuration than applications that handle online banking or other financial transactions.
To help you understand the depth of this topic, the discussion is divided into the following sections:
- Security Policy
- Security Procedures
- Security Compliance
- Authentication and Trust
- Authorization Procedures
- Access Control Mechanisms
- Data Confidentiality
- Data Integrity
- Examples of Security Technologies
A security policy is a document used as a guide to ensure that you adequately formulate and consistently implement security practices. A security policy document is often called a "living document" because it is never finished; it changes and evolves as conditions and requirements change. A security policy is also a mechanism by which to generate consensus among a group of people, allowing various levels of management and operations to express their views and goals. A security policy document broadly defines relevant security responsibilities, terms, and standards and provides a basis for more specific security documents, such as remote access policies, acceptable use policies, nondisclosure policies, corporate and departmental security policies, web site security policies, and privacy policies.
Initially, web sites did not make overt efforts to develop and document security policies, but as web sites have matured and grown in sophistication, publishing web content or providing online services now involves a legal approval phase that comes after the technical QA and testing phase. The legal approval phase makes sure that content and services are compliant with stated and published policies. This is not necessarily a bad thing, but sign-off from the legal department or oversight committee has introduced another step in the release management process.
Within closed environments or internal business applications, explicit security policies were not considered absolutely necessary in all cases because the systems were considered private, and access to them seemed sufficiently restricted. However, with the introduction of VPNs, IRC, and remote access technologies, assuming that a private or internal application operates without risk of exposure to security compromise is no longer valid. If you are remotely connected to an internal application, you can bridge security zones without knowing that you are making the internal application vulnerable. Users greatly increased the spread of the "I love you virus" by accessing personal e-mail accounts while at work, infecting private and internal file systems with the virus.
A security procedure is a defined list of steps to follow to maintain security or to respond to various situations where security has been breached. Security procedures are grouped as administrative or technical. CISS professionals should participate in documenting administrative procedures for creating and removing accounts, gaining physical access to computer systems, and handling problem escalation. Administrative security procedures govern such things as remote client access to application services. For example, create an access request form to keep an accurate list of who is authorized to access various features of an application in a controlled fashion. If you need to change remote client access, contact information is necessary so that communicating changes to passwords is graceful and without confusion.
Technical security procedures cover application operations, active and passive intrusion detection, and emergency response. Requesting remote access to a staging environment web instance is an example of a technical security procedure. Rather than have open access to staging, create firewall conduits and rules that limit access to only authorized remote IP addresses and address blocks. Firewall conduits must be created to uphold security policy, while still accommodating remote access needs. Within the Cisco PIX Firewall setup, you should not only apply IP names so that security logs and security audits are easier to read and check for anomalies, but also apply comments. An example of Cisco PIX firewall conduits and comments to allow a class C network for a San Francisco office and a client proxy server to access a staging web site is shown here:
conduit permit tcp host stage.web.site eq 80 sanfrancisco-office 255.255.255.0 conduit permit tcp host stage.web.site eq 443 sanfrancisco-office 255.255.255.0 :the above 2 conduits allow the SF office http and https access to staging site conduit permit tcp host stage.web.site eq 80 host proxy.client.server conduit permit tcp host stage.web.site eq 443 host proxy.client.server :the above 2 conduits allow the client http and https access to staging site
Although creating firewall conduits instead of allowing world access to a staging site's services might mean more work for administrators, it is generally considered worth the effort. Only if the requestor comes from a dial-up environment with hundreds of possible IP address assignments would username and password authentication be preferable to firewall-restricted access procedures. Firewall conduits, however, are only one part of a good security solution. Simply creating firewall conduits is considered passive intrusion detection because an administrator must actually check the firewall logs to discover an intrusion. Active intrusion-detection tools should also be used because they are configured to automatically trigger alerts and notifications as they monitor server connections and firewall logs.
Compliance is not a one-time security check but an ongoing effort to verify the maintenance of high security precautions. CISS professionals are expected to keep abreast of security exploits by reading the CERT Coordination Center advisories and browsing other security forums. Awareness of vulnerabilities is important to maintain security policy compliance.
CERT/CC and FIRST
CERT/CC (originally referred to the Computer Emergency Response Team) is a service provided by the University of Carnegie Mellon in Pittsburgh, Pennsylvania. CERT/CC (http://www.cert.org) was started in 1988 to facilitate rapid information flow about security activities and is primarily funded by the U.S. Department of Defense.
The Forum of Incident Response and Security Teams (FIRST, at http://www.first.org) was created to help grow and evolve the many incident response and security teams to reduce differences caused by their individual purpose, funding, reporting requirements, and audiences.
Do not delay applying patches or software upgrades after learning of relevant vulnerabilities. Security compliance means the difference between falling victim to a virus crafted by the latest attacker and being able to do business as usual.
Periodically audit the organization's compliance with the security policy, using both internal and third-party auditing services.
Authentication and Trust
Authentication verifies that the identity of an entity is truly whom it claims to be, and trust provides either direct knowledge about each communicating party, or the assertion of a trusted third party. Before an application can grant access to the resources it manages, the application must be made aware of who or what is requesting the access. In other words, requestors for access must identify themselves positively before they are trusted to access the application. Authentication takes two basic forms: who you are (fingerprint, retina scan, and so on), and what you have (username, password, smart card, and so on). The best security schemes require both forms of authentication.
Identification is a basic security requirement of any application. All requesting entities, whether a person, device, or process expected to access the application, must be assigned a unique identifier (UID). This unique identifier, such as a user ID or username, forms the basis for establishing what can be accessed and can also be used for auditing purposes. Additionally, with audit logging enabled, the unique identifier's owner can be held accountable for all accesses made by that identifier.
Consequently, the identifier's owner should be the only one able to use that identifier to access an application. This requirement is where authentication becomes involved. For example, to authenticate a remote host or person, a local host requests a username and password (from the other host or person) and verifies that the username and password are valid by comparing them to values stored on an authentication service or server. Authentication determines a user identity and then verifies that information. The username and password example relies on the fact that the password is only known by the owner of the username.
When considering authentication requirements for an application, remember that rather than build an entirely new authentication structure in the application, using external authentication services that are available to the application might be more desirable. For example, applications can call upon authentication services that are available at the operating system level, such as Windows NT domain login or UNIX user accounts. An advantage of choosing this option is that it offers a Single Sign-On (SSO) approach, meaning the requestor has one less username to remember, and one less logon procedure to encounter.
A trust relationship is required for two or more discrete security zones to share user information or device access. Explicit trust relationships must exist between multiple authentication domains or realms, and are central to sound security practice.
Hierarchical trust models are implicitly easier to manage than flat trust models. With the advent of Microsoft Active Directory Services, Windows trust domains have moved away from a flat trust model to a hierarchical trust model. Hierarchical chains of trust link digital certificates from a certificate authority (CA) or software application during secure, encrypted communications used in e-commerce transactions.
An authorization procedure defines a formal request process to grant authentication privileges. Authorization procedures occur prior to using access controls. Ask the following questions when establishing authorization procedures:
Who is allowed to request or approve access? Can end users request access for themselves? Depending on the privilege level of the access being requested, another entity will most likely have to approve the request. For example, access requests for end users should be submitted and approved by the responsible supervisor/manager or sponsor.
How can an access request be validated? A customer requesting access to services should have a way to register and apply for the access. Sufficient information should be collected during the registration process to allow the requestor's identity to be authenticated, and for the requestor's entitled privileges to be verified (customer entitlement).
How can the proper use of accessed resources be ensured? To ensure proper use of authorized access, authorized users must understand their responsibilities. Approval does not mean that authorized access can be exploited. End users must agree to usage conditions, as prescribed by surrounding business policies. Appropriate audit and system logs must be maintained and reviewed to ensure responsible usage. End users will be tempted to share their usernames and passwords when access problems occur. Warn against this, because it undermines the validity of security audit trails and tracking.
Access Control Mechanisms
Access control restricts access privileges to the resources available to an application. Access control mechanisms determine who can access what resources and perform what actions. This makes access control requirements a crucial part of ensuring that an application functions securely.
Access control mechanisms should do the following:
Define various privilege levels to different user security profilesFor example, the select privilege to data in a table can be granted to Security Profile A; the select and update privileges can be granted to Security Profile B, and the delete privilege can be granted to Security Profile C. Typically, each security profile represents the access required by a group of end users with similar roles or functions. In the case of databases, create customized views of the original data tables being queried to support such roles or functions.
Assign a security profile to each end userContinuing with the previous example, users X, Y, and Z can be assigned Security Profile A; users K, L, and M can be assigned Security Profile B; and so on. Instead of granting access to individual user IDs, access is granted to each security profile, reducing the amount of effort required to maintain user accounts. Generally, where database security is concerned, a group ID or secondary authorized ID is granted the appropriate privilege to the database resource. For most other protected resources, an access control list (ACL) is maintained for each resource. In the ACL, each security profile is granted the appropriate privilege to that resource.
Manage user accounts and policiesA facility must be available for managing user accounts. This facility must include the administrative functions of adding, modifying, deleting, and revoking or suspending user accounts. The following is a checklist of common security policies for password-based user accounts:
Account lockout must occur after five unsuccessful logon attempts.
Assigned passwords for new or reactivated accounts must be generated randomly.
New or reactivated accounts must require a new password to be entered at initial logon by the account's owner.
Passwords must have a minimum length of six alphanumeric characters.
Passwords must not be displayed when being entered but should be entered twice for verification. Passwords must be different from previously used passwords.
Passwords must expire after 60 days.
Passwords must be stored encrypted with a one-way encryption algorithm.
Password reset functions must be kept separate from full administrative rights, for help desk purposes.
Encrypt stored and transmitted dataWhere necessary, an application must have the capability to encrypt sensitive or confidential data. However, encrypting large amounts of data can cause severe overhead to occur during data retrieval. If encryption is required, a public-key encryption algorithm, such as RSA, or a symmetric-key encryption algorithm, such as Data Encryption Standard (DES), triple DES, RC4/5/6, or International Data Encryption Algorithm (IDEA), should be selected. Select an algorithm that performs well and is an industry standard (to ensure interoperability). As an alternative to software encryption, hardware-based solutions, such as encryption cards for web servers, can offload the processing demands of applications requiring strong encryption of large amounts of data.
Access control mechanisms should also apply to managing administrative or super-user accounts. Furthermore, such highly privileged accounts should be constrained to a limited number of personnel who are directly responsible for administering the user accounts, applications, or systems.
How these access control mechanisms are actually implemented and managed depends largely on the business policies that define an application's use and its access requirements. The more granular the access requirements are, the more complex the access control structure is, and the more resources are required to maintain the structure.
Confidentiality is the characteristic of information being disclosed only to authorized persons, entities, and processes at authorized times and following authorized protocol. Whether an application manages customer data or internal company data, a business organization is responsible for protecting the privacy of such data.
Customers believe they possess the right to have their private information protectedthis confidentiality might very well be a legal requirement. A business would do well to maintain a trustworthy relationship with its customers by respecting their right to privacy.
Company proprietary information that is sensitive in nature also deserves to have its confidentiality protected. Only authorized parties must be granted access to information that has been identified as confidential. Data encryption within the application also supports the data confidentiality goal.
Integrity refers to the assurance that data is not modified in an unauthorized or unintended manner. In other words, integrity preserves the accuracy and completeness of the data. Even for data that may be classified as nonconfidential, applications must be designed to preserve the integrity of the stored data.
Integrity maintenance begins with the application. Applications must be designed to collect accurate and complete data. For example, data entry fields must contain logic to check for valid values. Edit checks include numeric only values, date checks/formats, a range of values, drop-down lists, and so on. Logical checks should ensure that interdependent data elements relate to each other sensibly. For example, it would not make sense for a person to indicate a 1970 birth year, and subsequently specify an age range of 1925 (in the year 2002).
As with confidentiality, data integrity's goal is further achieved by properly implementing access control mechanisms. Applications must be designed to handle different privileged functions that affect data integrity, such as add, delete, or modify. Defining separate data entry panels or forms for each of these functions might be necessary.
The purpose of auditing an application is to ensure that all functions available in the application operate and are being used as intended. Information assets must be controlled and monitored with an accompanying audit log to report any modification, addition, or deletion to the information assets. These logs must report the user or process that performed the actions. Affixing the date, time, and responsibility to an individual must be possible for all significant events. Audit logs must also be protected from alteration and destruction. To accomplish this protection, audit logs can be offloaded to an isolated log server where access is more restricted.
The outcome of auditing an application might include the following findings:
Auditing an application might uncover inadequate separation of duties and responsibilities, as provided by the existing functional design in the application. Modifications in the application or security profiles might be necessary to correct this.
An audit might reveal inappropriate assignment of privileges to end users because of a lack of well-defined authorization procedures.
An audit might discover inadequacies or deficiencies in user account administration, such as the existence of accounts that must be deleted. Deficiencies can also include poor password management policiesminimum password length of two characters, passwords do not expire, passwords are stored in clear text, and so on.
An application audit might further discover inappropriate access made by an authorized person.
In CISS, accountability means the ability to uniquely trace an individual's or an institution's actions to that individual or institution. The ability to audit the actions of all parties and processes that interact with information leads to accountability. Roles and responsibilities should be clearly defined, identified, and authorized at a level commensurate with the information's sensitivity and criticality. The relationship between all parties, processes, and information must be clearly defined, documented, and acknowledged by all parties. All parties must have responsibilities for which they are held accountable.
With the advent of digital certificates and digital signatures, accountability can be enforced through nonrepudiation. Nonrepudiation essentially ensures that an event or transaction has taken place. With nonrepudiation abilities in place, the initiator or author of an event or transaction cannot deny he was responsible for that event or transaction.
To enforce accountability, you can require client-side certificates for secure access to important resources or transactions. In this situation, a server provides a digital certificate that the client identifies as trusted, and the client machine contains a digital certificate that the server acknowledges as coming from an appropriate trusted CA. This level of authentication, when combined with usernames, passwords, and firewall conduits, allows the utmost in trusted access control.
Examples of Security Technologies
This section examines examples of security technologies. See Appendix A for further security references, as this section only begins to discuss the complex issues involved in application security.
Security technologies are technical standards, tools, and services designed to provide security functions that enable secure applications. Security technologies serve as technical safeguards and countermeasures that protect information assets from compromise, loss, or destruction. Additionally, security technologies can manage complex security functions, such as security profile management.
Security technologies extend an application's functionality beyond its basic design. The final result is that the application is made more accessible and available through security. For example, a web-based application that conducts business involving sensitive information over the Internet would not be useable if its sessions were not encrypted with the Secure Socket Layer (SSL) protocol and technology. Using SSL enables the application to be more readily accepted and used.
A security service accepts and processes requests for security privileges. Security services include the following:
GSSAPIApplications use the Generalized System Security Application Programming Interface (GSSAPI) to take advantage of centralized security services. The GSSAPI enables an application to see its users' authenticated identity, check their privileges, and record their activities. Because the GSSAPI is product independent, the application does not need to know which product provides the services. The GSSAPI can use one security product in one context and a different one in another.
Single Sign-On (SSO)SSO's purpose is twofold. First, it centrally manages a multitude of user accounts that exist across multiple platforms, systems, and applications. Second, it allows the end users to sign on only once for authentication purposes. Their authorized rights to all systems and applications will then be accessible. SSO is ideal for an environment with multiple applications that possess their own user account security structures. The recommended approach to SSO is to ensure that you use stronger authentication methods with it (for example, digital certificates, smart cards, and biometrics). Kerberos and Lightweight Directory Access Protocol (LDAP) are common mechanisms deployed to achieve single sign-on.
A directory service provides access to users and applications by identifying resources on a network.
An example of directory services is the X.500 standard. The standardized infrastructure of the Open System Interconnection (OSI) application layer includes the Directory, a specialized database system used by other OSI applications, and by people, to obtain information about objects of interest in the OSI environment. Typical X.500 Directory objects correspond to systems, services, and people. Information found in the Directory includes telephone numbers, e-mail addresses, postal addresses, network node addresses, public-key identity certificates, and encrypted passwords.
You should know about the following technologies (each is discussed in turn in the following paragraphs):
- Public Key Infrastructure
- Intrusion-Detection Systems
- Virtual Private Networks
Public key infrastructureEnabling public key infrastructure (PKI) in an application means that the application can employ digital certificates and digital signatures for encryption, identification, authentication, access control, authorization, accountability, and nonrepudiation. This functionality positions the application to function securely and scale well into the future.
Specific terminology associated with public key infrastructure includes public key, private key, certificate authority, secure shell (ssh), Pretty Good Privacy (PGP), and keyserver. See the Glossary for a description of public key cryptography and encryption terminology, and Appendix A for several public key infrastructure references.
Intrusion-detection systemsA host-based intrusion-detection system (IDS) is an effective tool that monitors events in an application log for attacks, anomalous activities, abnormal resource utilization, and impaired availability. A network-based IDS can monitor for attacks against, and unusual behavior of, the application across network segments. An IDS furthers the application's availability by alerting for appropriate responses to potential threats before any major damage occurs. Specific terminology related to an IDS includes tripwire, honey pot, and packet sniffer. See the Glossary for a description of IDS and Appendix A for references to books about security.
Virtual Private NetworksVirtual Private Networks (VPNs) make applications securely accessible over the public Internet. For example, business partners and employees can access the company intranet using a VPN tunnel across the Internet. VPNs essentially extend the accessibility and availability of internal application resources securely over the Internet. Many VPN connections use IPSec, which encapsulates packets bound for remote VPN targets as an encrypted packet. Some firewalls, however, do not accept IPSec packets as legitimate, so other encryption algorithms must be used in these situations.
Prior to the development of VPN technologies, dialup connections provided secure remote access to company intranet applications. This involved multiple phone lines, both for the company and for the remote users. VPNs take advantage of pre-existing network connections, avoiding the expense of modems, phone lines, and long distance telephone connection charges.
Reliability is the probability that a system or a system's capability functions without failure for a specified time. Reliability is also defined as the probability that an item will perform its intended function for a specified interval under stated conditions, and is measured by time-to-failure, in hours, cycles, miles, missions, and so on.
Consider reliability under the following criteria:
- Hardware Reliability
- Software Reliability
Hardware is reaching reliability levels today that were unheard of a few years ago. Hardware reliability systematically reduces, eliminates, and controls system failures that adversely affect a device's performance. In cases where failures cannot be eliminated or controlled because of cost or design limitations, reliability engineering provides data for overall risk assessment.
When a complex hardware system begins its life, it often has a high failure rate in terms of defects per unit of time. As the defects are worked out, the failure rate drops to an acceptable level at which it can remain for many years. Because of its physical attributes and the forces of nature, components usually begin to wear out and the failure rate begins to climb. Although the failure rate's rise and fall varies from one hardware system to another, and the time frame associated with the system's useful life can vary by many years, the "bathtub-shaped" trend shown in Figure 4-1 is typical.
Figure 4-1 Bathtub Curve
Although hardware is becoming more reliable, software is, on average, becoming less reliable. Almost without exception, the main culprit is application software. Operating system facilities, which are exploited by many more users, are, consequently, better tested than most applications.
As applications offer more features, to accommodate the rapid growth in system resources, new code must be added. More code means more possibilities for bugs, leading to an increased risk of application software failure.
The decreasing reliability of application software does not mean that operating systems are totally without fault, rather that they contribute relatively few failures to the overall software failure number. Code analysis has determined that for each 1000 lines of code, one or two bugs exist. This would not be such a problem if software grew in increments of thousands of lines of code, but because of constant demands for greater functionality, short development windows, and emphasis on speed-to-market, software grows by millions of lines of code between major releases. Combating this trend is one of the software industry's greatest challenges.
Availability is the probability at any given time that a system or a system's capability functions satisfactorily in a specified environment. If you are given an average downtime per failure, availability implies a certain degree of reliability. Failure intensity, used particularly in the software reliability engineering field, is the number of failures per natural unit or per time unit. It is an alternate way of expressing reliability.
Enterprises implement high-availability networks for the following reasons:
Prevent financial loss
Prevent lost productivity
Improve user satisfaction
Improve customer satisfaction and loyalty
Reduce IT support costs to increase IT productivity
Minimize financial loss resulting from network unavailability
Minimize lost productivity resulting from network unavailability
Availability is probably the most important property of your computer system. If a system is not available to run the intended workload or perform the tasks vital to your business, the system's speed or memory capacity won't matter. For example, consider the fiercely competitive arena of Internet business. If your web site is not available, or if you can't process an order because your backend server is down, you might lose a potential customer or damage the relationship you have with an existing one. All businesses, not just the Internet, need increased availability to stay competitive.
The following are factors within availability:
Ensuring availability through Service Level Agreements and failover methodologies
The text discusses each in turn.
Don't discount user error from discussions and plans regarding reliability. For systems with high availability requirements, mistyping one character can sometimes cause several hours of downtime, expending several year's worth of unplanned downtime. Rather than hope and pray that user error will not surface during the lifetime of your project, it must be taken as a given and dealt with like any other requirement or feature.
Expected user errors can be handled and managed as well-structured problems, with well-structured solutions. Computer-human interface research has identified many important design elements associated with user behavior and their likely errors. Identify and highlight end user error, as well as administrative user error, as significant areas for development and testing criteria.
Software developers feel that users should be responsible and aware of everything about an application's functions, but this bias comes from their own extensive familiarity with the application they are developing. Any developer who demands more intelligent users instead of writing more intelligent software will be sorely disappointed when asked to rewrite code to accommodate end user behaviors.
Outages are those periods of time when the system is not available to perform useful work. Whether your customers are external to your business or users of your own computer systems within your business, computer systems outage represents a major problem for an increasing number of computer users. Any period of computer system outage can directly translate into lost revenue for the business. Typical outages cost an average company 10,000 dollars per minute. This requirement for continuous access to computer systems spans both customers and computer systems. The need for computer availability has never been greater.
Major causes of outages include site failures, cut cables, power outages, fires, and floods, or large-scale disasters, such as earthquakes, hurricanes, and tornadoes. Although companies continue to increase their servers' availability, they find that site failures or disasters pose a very real threat to their competitive edge. Companies should take disaster survivability seriously when considering their availability strategy.
Ensuring Availability through Service Level Agreements and Failover Methodologies
Corporate availability strategies are usually covered by a disaster recovery plan that takes the offsite storage of critical data into account, but it can also include two alternative methods for application availability: Service Level Agreements, and failover methodologies.
Service Level AgreementsIntegrating process improvement, best practices, technical expertise, and availability management tools provides the necessary ingredients for a healthy and highly available network. For customers committing to this rigorous process, most service providers provide health and availability assurances in the form of Service Level Agreements (SLAs).
The penalties for missed SLAs include additional high-availability technical resources to fix the availability issue, increased escalation, and financial penalties.
SLAs are the company's means to ensure the following:
Clear performance goals
Heightened awareness, attention, and accountability
Strong linkage between network performance and availability goals and the customer's business requirements
Failover methodologiesA failover is completed by maintaining an up-to-date copy of a database, codebase, or network device on an alternate system for backup. The alternate system takes over if the primary system becomes unusable. This model utilizes duplicate network and server hardware configurations in which one device or server has the active role, and the other is a backup that monitors the active device or server's state. When the back-up device or server detects a hardware or software failure on the active server, it takes over the active server's role and identity. Components are configured to address the "virtual" device name or address, which is alternately hosted or served by the active or back-up server.
You can provide failover with a cluster. A cluster is a group of computers, usually referred to as nodes that are interconnected to provide a single computing resource. Clusters offer much higher availability than a single system, allowing avoidance of both planned and unplanned outages. High availability also refers to being able to service a component in the system without shutting down the entire operation. Given these requirements, a cluster is an ideal solution for customers.
To address the planned outage component, workloads can be moved to another cluster node while administrative changes occurs, allowing changes to be made while still maintaining a useful service. When the maintenance activities are completed, the workload can be moved back to its original location, if desired.
Redundant components handle unplanned outages. Redundancy in the cluster hardware (whether redundant servers, networks, storage, or adapters) allows work to continue transparently when one or more hardware or software components fail. A provided service continues by simply switching to other cluster components. If a node fails, service can be moved to another node. If a disk fails, service continues from another disk containing the same data, and so on. A cluster acts as a single, continuously available system in this respect. System resource availability is one of the greatest advantages of clustering.
Clusters provide easy management of your computing resources without restricting the systems' capabilities. You should be able to work with the individual components in addition to the cluster as a whole. Many people think that a cluster is beyond the scope of their requirements, or that it is only for large systems. This belief is unfounded. Any system that is important to a business qualifies for clustering.
Whether for a mission-critical enterprise server or a server for a small workgroup, clustering provides availability. The smallest workgroup servers can have enterprise-class availability characteristics through clustering. The ability to build clusters with small, cheap servers, or by reusing servers that are surplus to requirements elsewhere within the business, allows significant improvements in availability.
Cluster performance needs to be carefully planned and tested. Simulated failures must be tested regularly with clustering technologies because even minor configuration differences can affect whether failover is triggered as desired. A major problem with clustered environments is for a failover to be triggered, but for the alternative node or nodes to not pick up the active state. Expressing cluster configurations as strict statements of propositional logic is often necessary to build reliable cluster configurations. These configurations need to progress from the general to the specific. For example, cluster database triggers might first test for the existence of an active database application process ID before testing a particular tablespace's contents within that database. Appendix A contains the titles of several books that include extensive treatments of highly available systems design.
You can provide failover using a mirrored site. A mirrored site is a web site that is a replica of an existing site. The mirrored site is used to reduce network traffic, decrease hits on a server, or improve the original site's availability. Mirror sites are useful when the original site generates too much traffic for a single server to support. Such sites also increase the speed with which files or web sites can be accessed.
Users can download files quicker from a server that is geographically closer to them. For example, if a busy New York-based web site sets up a mirror site in England, users in Europe can access the mirror site faster than the original site in New York. Sites, such as Netscape, that offer copies or updates of popular software often set up mirror sites to handle the large demand that a single site might not be able to handle. Keep in mind, however, that geographic proximity does not necessarily mean faster network access speeds. A T3 connecting New York with Miami is faster than a T1 from New York to Washington, DC.
You can also provide failover using a backup. Backup types are based on the files selected for backup:
Full backupBacks up all selected files.
Differential backupBacks up selected files that have been changed. This backup is used when only the latest version of a file is required.
Incremental backupBacks up selected files that have been changed. If a file has been changed for a second or subsequent time since the last full backup, the file does not replace the already backed-up file; it is appended to the backup medium. This backup is used when each file revision must be maintained.
Delta backupBacks up only the actual data in the selected files that has changed, not the files themselves. This backup is similar to an incremental backup.
Recovery occurs with a backup when the need arises because of a planned business situation or unplanned event. Recovery times vary greatly depending on which schemes of which types of backups are performed. Daily full backups generally result in the shortest recovery time. Weekly full backups, with daily incrementals require application of the full backup, and then sequential applications of the incremental backups. If this involves the request and retrieval of offsite tapes and recovery media, an outage requiring a full recovery can take several days instead of several hours.
Rapid recovery options are expensive and must be carefully planned, and as data usage grows, they must be modified to accommodate growing data requirements and recovery demands. An important part of any disaster and recovery plan is to make periodic recovery procedure tests. These procedures must be tested for data integrity, as well as completeness. Finding out that the data can be recovered but that the configurations are unavailable can result in a situation where no viable recovered state can be attained.
Using fault tolerance can also provide failover. Fault tolerance is the ability to continue to perform a specified task even when hardware failure occurs. Fault tolerance improves reliability, availability, safety, performance, maintainability, and testability. A fault-tolerant system is designed by building two or more components, such as CPUs, disks, memories, and power supplies, into the same computer. In the event one component fails, another component takes over immediately. Many fault-tolerant computer systems mirror all operations. For example, each operation is performed on two or more duplicate systems so that if one fails the other takes over.
Many systems are designed to recover from a failure by detecting the failed component and switching to another computer system. These systems, although sometimes called fault-tolerant, are more widely known as high-availability (HA) systems. They require the software to resubmit a job when the second system is available.
A fault-tolerant system recovers from a fault in three steps:
- The fault is detected.
- The fault is isolated.
- The fault is corrected.
True fault-tolerant systems are costly because redundant hardware is wasted if no failure occurs in the system. On the other hand, fault-tolerant systems provide the same processing capacity both before and after a failure, whereas high-availability systems often provide reduced capacity after a failure.
Hardware-based fault tolerance can react to component failures instantaneously, without the need for software-based recovery. Cluster-based solutions require anywhere from several seconds to many minutes to recover disk-based data and restart applications. Recovery time in a high-availability cluster varies based on application parameters, such as database size, transaction rate, and type of workload. A system's hardware-based fault tolerance is completely insensitive to these nonfunctional requirements and always provides uninterrupted service in the event of component failure.
Checkpoints can provide failover. Checkpointing is a simple technique for rollback error recovery. Rollback error recovery occurs when an executing program's state is periodically saved to a disk file from which it can be recovered after a failure. In the event of a failure, the last checkpoint serves as a recovery point. When the problem has been fixed, the restart program copies the last checkpoint into memory, resets all the hardware registers, and starts the computer from that point. Any transactions in memory made after the last checkpoint is taken are lost.
Programming abstraction provides a means to implement fault tolerance at various levels. At the application level, checkpointing can be done with C++ code using a preprocessor that inserts most of the checkpointing calls automatically. This saves the programmer time and work.
Another option to accomplish fault tolerance is to use a low-level checkpointing package to minimize the user's work. The problem with this option is that determining a consistent state is a major issue. Journaling file systems also provide this level of reliability at the disk level and reduce downtime greatly by not having to run lengthy disk integrity checks upon restart.
Another failover option is using hot swappable disks. A hot swap is the replacement of a CPU, hard drive, CD-ROM drive, power supply, or other device with a similar device, while the computer system using it remains in operation. Replacement can be necessitated by a device failure or, for storage devices, a need to substitute other data.
Hot swapping provides a rack or enclosure for the device that presents an appearance to the computer's bus or I/O controller that the device is intact while it is being replaced with another device. A hot swap arrangement where multiple devices are shared on a local area network is sometimes provided. Hot swap arrangements are sold for both Small Computer Systems Interface (SCSI) and Integrated Drive Electronics (IDE) drives. Hot swap versions of a redundant array of independent disks are also available.
Redundant Array of Independent Disks (RAID) provides failover by storing the same data in different places on multiple physical disks. By placing data on multiple disks, I/O operations can overlap in a balanced way, improving performance. Because having multiple disks increases the mean time between failure (MTBF), storing data redundantly also increases fault tolerance.
A RAID appears to the operating system as a single, logical volume. Some RAID configurations employ striping, which involves partitioning each drive's storage space into units ranging in size from one sector (512 bytes) to several megabytes. The stripes of all the disks are interleaved and addressed in order.
In a single-user system where large records, such as medical or other scientific images, are stored, the stripes are typically set up to be small (perhaps 512 bytes) so that a single record spans all disks and can be accessed quickly by reading all disks at the same time.
In a multi-user system, better performance requires establishing a stripe wide enough to hold standard-sized or large records. This allows overlapped disk I/O across drives.
The following are types of RAID:
RAID-0This technique has striping but no data redundancy. It offers the best performance but no fault-tolerance.
RAID-1This type is also known as disk mirroring and consists of at least two drives that duplicate data storage. There is no striping. Read performance is improved because both disks can be read at the same time. Write performance is the same as for single-disk storage. RAID-1 provides the best performance and the best fault tolerance in a multi-user system.
RAID-2This type uses striping across disks with some disks storing error checking and correcting (ECC) information. It has no advantage over RAID-3.
RAID-3This type uses striping and dedicates one drive to storing parity information. The embedded error checking (ECC) information detects errors. Data recovery is accomplished by calculating the exclusive OR (exclusive OR is known as XOR and is a boolean logic condition where either of two conditions are true, but not both) of the information recorded on the other drives. Because an I/O operation addresses all drives at the same time, RAID-3 cannot overlap I/O. For this reason, RAID-3 is best for single-user systems with long record applications.
RAID-4This type uses large stripes. You can read records from any single drive. This allows you to take advantage of overlapped I/O for read operations. Because all write operations have to update the parity drive, no I/O overlapping is possible. RAID-4 offers no advantage over RAID-5.
RAID-5This type includes a rotating parity array that addresses the write limitation in RAID-4. All read and write operations can be overlapped. RAID-5 stores parity information but not redundant data, but parity information can be used to reconstruct data. RAID-5 requires at least three, and usually five, disks for the array. It is most suitable for multi-user systems in which performance is not critical or few write operations are performed.
RAID-6This type is similar to RAID-5 but includes a second parity scheme that is distributed across different drives and offers extremely high fault- and drive-failure tolerance. Few, if any, RAID-6 commercial examples currently exist.
RAID-7This type includes a real-time, embedded operating system as a controller, caching through a high-speed bus, and other characteristics of a stand-alone computer.
RAID-10This type offers an array of stripes in which each stripe is a RAID-1 array of drives. This offers a higher performance than RAID-1 but at a much higher cost.
RAID-53This type offers an array of stripes in which each stripe is a RAID-3 array of disks. This offers a higher performance than RAID-3 but at a much higher cost.
Scalability is the ability to change size or configuration to suit changing conditions. For example, a company that plans to set up a client/server network might want to have a system that not only works with the number of people who will immediately use the system, but also with the number who might be using it in ten years.
Scalability also refers to how well a hardware or software system can adapt to increased demands. A scalable network system can start with just a few nodes but can expand to thousands of nodes.
Scalability is an important feature because it means that you can invest in a system with confidence that you will not outgrow it. Scalability is the ease with which a system or component can be modified to fit a problem area. Achieving scalability is usually a combination of good database management (caching, connection pooling), shifting to multiple copies of the application server, using failover capacity, and utilizing load balancing.
This section discusses software scalability, network scalability, application architecture, and hardware scalability.
Scalable application components can support more than a single instance of an application and retain context, which means retaining the conditions, parameters, and arguments used in a previous operation or query. Scalable application components include the software's capability to employ threads while the software is in operation. A thread is the work performed within individual processes.
Writing multithreaded applications enables more efficient use of resources and more scalable applications. Software components that can be distributed across multiple platforms are a characteristic of scalable software solutions. This allows hardware additions and resource expenditures to be strategically matched to the principal growth parameters that affect system performance and growth. Multiprocessing and multithreading are soft- ware mechanisms that can achieve the performance and scalability potential offered by Symmetric Multiprocessing (SMP) and Massively Parallel Processor (MPP) systems.
When a developer creates an application, much of the effort goes into defining the scalability of the application. Questions involving scalability include the following:
How many records can the system hold?
How many users can use the system at one time?
How long will retrieving a piece of data from the system take?
How well does the application grow?
In a traditional application, the developer provides answers to these questions when the system is being built. If the developer fails to address them properly, the application will not scale. As more users begin to access and place a greater load on the system, failures begin to occur. The system will buckle under the load of more users than it was designed to handle.
In a transaction-processing environment, the preceding questions are answered without the developer even having to think about them. The developer creates components that interact with the system as if they had exclusive access to the system and had the system's full resources available to it. The environment efficiently manages multiple objects' resources that could be active at any given time so that each one runs successfully. The environment can reuse objects, rather than create new ones, when they are needed, and destroy them after their work is completed. By managing the objects' lifetime, the environment ensures that as more objects are needed when more users access the system, the system will be able to support them.
Mission-critical activity over the network is on the rise. A scalable network is more important than ever on both the Internet and on corporate intranets because new network applications, often with web browser interfaces, make networks easier to use. The result is that more people are on the network, generating more traffic per user, at an explosive rate. Many new products, protocols, and services have been created to help address this growing need for highly scalable network solutions.
Cisco Systems, Inc. provides scalability services beyond the conventional hardware approach of increasing bandwidth and port density. Advanced Cisco IOS technologies help achieve performance capacity from network infrastructure with minimal upgrade requirements using Cisco IOS scalability services, protecting your investment for years to come. Such services include the following:
Tag SwitchingThis highly scalable technology improves performance over large enterprise networks or over Internet service provider networks by speeding delivery.
NetFlow SwitchingThis service streamlines the way packets are processed and features are applied. NetFlow switching provides high performance for switching and for higher-layer services, such as quality of service (QoS), security, and traffic management.
Express ForwardingThis service is a new technique from Cisco for scaling Internet backbones through distributed packet forwarding. Express forwarding is one of the fundamental capabilities required for modern Internet routers and next-generation Internet routers to handle the increased load, dynamic traffic patterns, and new applications of the Internet.
Network Address Translation (NAT)This service allows enterprises and service providers to conserve valuable IP addresses by hiding internal IP addresses from public networks, such as the Internet. NAT reduces time and costs by easing IP address management.
Software-Based CompressionThis service increases performance by reducing the amount of traffic over expensive WAN lines.
DistributedDirectorThis service provides dynamic, transparent, and scalable Internet traffic load distribution of all IP traffic between multiple servers across topological distances.
Application architecture plays an important part in an application's scalability. During the design phase, anticipated growth and other influences on an application's operational environment must be taken into account. Remember the following concepts while designing a solution to ensure a better level of scalability:
Use servers and storage devices in your design that can be upgraded or added to without having a negative cost or time impact on your application. This design includes server clusters or servers that can easily add processors and memory.
Consider using Operational Service Providers to provide some of your application architecture's operational components.
Application Service Providers should also be considered during the design phase when appropriate.
Operational Service Providers and Application Service Providers are options that can give you a pay-as-you-go implementation, helping minimize cost and reduce risk. Using a three-tier client/server architecture with Transaction Processing monitor (TP monitor) technology results in an environment that is typically more scalable than a two-tier architecture with a direct client-to-server connection.
From a hardware perspective, processor architecture can impact scalability. Symmetric Multiprocessing (SMP) and Massively Parallel Processing, or Massively Parallel Processor (MPP), architectures offer the potential for parallel execution.
Operating systems and databases can execute tasks on independent processes or threads. The system or database's performance and scalability increase. However, the work performed within individual processes (or threads) is constrained by the speed and resources of a single processor.
The number of processors your server supports also influences an Internet business application's scalability. Adding processors to your system might not always produce the results you expect. The reason for the poor performance is contention: lock, bus, or cache line contention. In contention, the processors fight over the ownership of shared resources instead of doing productive work. Having system monitoring tools in place and analyzing the data is essential for understanding scalability problems.
Storage Area Networks (SANs) are one hardware scalability solution. SANs are back-end network storage devices connected through a variety of standard peripheral channels and fiber channels. You can implement SANs in two ways: centralization and decentralization. A centralized SAN ties multiple hosts into a single storage system, which is a Redundant Array of Independent Disks (RAID) with large amounts of cache and redundant power supplies. The cabling distances allow local, campus-wide, and even metropolitan-wide hookups over peripheral channels rather than to overburdened networks. Storage networks enable a new approach to widespread sharing of large volumes of storage and, by implication, large amounts of data.
Application performance involves several components that can affect an Internet business solution's performance. User productivity and perception are key measures of the success of Internet business computer applications.
Unlike traditional system and network management, good performance management focuses on the area between an application being up and an application being down. Most organizations know when one of its applications is up or down but don't know what is happening in terms of application performance from an end-user perspective or know whether a significant new application will perform acceptably after it is deployed into the network.
The following performance issues are discussed:
- Architectural design strategies
- Key performance factors
- Custom-developed code
- Operating systems
- Application clients
Baselining is a critical technique in determining application performance. Performance baselining involves taking measurements at regular intervals, over a long enough period of time to encompass the normal and busy times. For example, you might find high utilization at certain times every day, but month-end business activity increases utilization to a new, higher overall level. The baseline needs to identify these critical activity peaks.
The frequency at which you take samples and the ways in which you interpret the raw data directly influences the end result of a baseline measurement. For the baselining effort to yield meaningful results, you must do the following:
Collect only the data that is pertinent.
Collect data at an acceptable frequency. Network and application usage patterns tend to vary.
Collect the data for an acceptably long period of time to ensure you have enough data to work with.
Determine the level of service. Identify metrics and acceptable performance levels. These are typically used in a Service Level Agreement and can include file download times, network latency measures, and application or server response times.
Applications are continually being added and upgraded, and user community responsibilities and their locations are dynamic. Based on this, some form of baselining activity is needed full-time to design and properly maintain a viable infrastructure.
Architectural Design Strategies
Your enterprise's application performance is also affected by design strategies and implementation choices. These architectural options include the following:
Logical packagingFoundation of a high-performance application always includes good, logical packaging. Logical packaging groups related application services into common components.
Physical deploymentAnother important design strategy is to physically deploy the components on the network for optimum efficiency. Application components that have a lot of interaction should be deployed as close to one another as physically possible. For example, place your application servers on the same network segment as your database servers. If database servers cannot be located on the same network segment, data replication or multiple database servers can be employed to improve overall database performance.
Multiple instances and reuseMultiple instances of a program mean that the program has been loaded into memory several times. This provides increased throughput by allowing the application to work on multiple requests.
A system's components are often made up of custom-developed, or coded software. Poorly executed custom-developed application code falls into four major categories:
Excessive requests for system services, typically I/O
Waiting for system service requests to complete
Implementing the following four concepts reduces the impact created by the problem areas described:
Remove unnecessary or unexecuted code.
Perform processing only when required.
Perform required tasks online and move processing to batch process when appropriate.
Perform required processing efficiently. "Tuned-up" reusable modules can improve quality and execution, and reduce development time.
Key Performance Factors
Upon exploring the idea of performance tuning and application optimization, you might think there are an endless array of possible configurations and parameters. Although this is essentially true, the following key performance factors account for the majority of performance gains that you can achieve:
- Workload growth
- Stored procedures and triggers
Each is discussed in the following paragraphs.
CompilerBe aware of the compiler options available and how they are used. Turning off debug and removing all extraneous logging can greatly increase performance. Make sure that the correct version and appropriate libraries are included, especially operating system libraries and libraries for other supporting devices (such as video drivers). The execution modules' distribution and how the processing is dispersed can affect system performance.
Workload growthDifferent kinds of workload growth affect application performance, including the following:
User populationMore users mean increased transactions, more component access, increased CPU consumption, more network traffic, and additional database access. For example, when the user population doubles, the network and database workloads probably double as well.
Database changesDatabases grow in many ways, including data complexity, stored data volume, and database usage.
Transaction complexityApplications and their transactions tend to become more complicated. Newly added cross-application interfaces and their associated transaction coordination can create unnoticed resource consumption or blockage.
Component allocationWith distributed components, on-demand component allocation consumes server computer resources. Without service queuing and object pooling, the server computer eventually loses efficiency and could fail.
Application populationAs applications become easier and less expensive to build, organizations use computers in new ways. These new applications increase the load on existing databases, server computers, and networks.
The workload your application experiences over time is generally predictable. After your application is up and running and establishes baseline performance statistics, you can identify workload growth trends and patterns using various performance analysis tools.
DatabaseDatabase access often imposes the largest performance penalty on your application. This is especially a concern with distributed applications where multiple clients simultaneously access common tables and rows. Although choosing the right data access technology solves an important part of your high-performance requirement, most of your application's database access speed comes from careful data-structure modeling, query optimization, and careful handling of multi-user concurrency situations.
IndexingIndexing improves the time it takes to access data in the database. Indexes affect not only query performance but also the performance of update, insert, and delete statements. The proper or improper use of indexing on columns within a database can greatly affect the performance of a database. Typically, the primary key for each table is indexed. Foreign keys used with other tables are often indexed to improve performance.
NormalizationDatabase normalization organizes the contents of the tables for transactional databases and data warehouses. Normalization is used after the initial data objects have been identified in a database and usually duplicates data by creating additional tables. This data duplication should not be confused with redundant data, which is the unnecessary duplication of data. Normalization is part of a successful database design. Without normalization, database systems can be inaccurate, slow, and inefficient, and might not produce the results you expect.
A poorly normalized database and poorly normalized tables cause problems ranging from excessive disk Input/Output (I/O) and subsequent poor system performance to inaccurate data. An improperly normalized condition can result in extensive data redundancy, putting a burden on all programs that modify the data.
Businesses with bad normalization experience poor operating systems and inaccurate, incorrect, or missing data. Applying normalization techniques to Online Transaction Processing (OLTP) database design creates efficient systems that produce accurate data and reliable information.
Stored procedures and triggersStored procedures are pieces of application code that reside in the database. The stored procedures' advantage is that they cut down on the number of messages that are passed between the application process and the database server. Significant performance increases can be realized if the network is slow, or if several SQL statements are grouped together in a stored procedure.
You can also set triggers within the database to respond to certain events. Stored Procedures and Triggers together make up a powerful development tool.
Configuring and tuning the operating system is critical to application performance. The operating system is the master control program that runs the computer. It is the first program loaded when the computer is turned on, and its main part, known as the kernel, resides in memory at all times. It can be developed by the computer's manufacturer or by a third party.
All programs must communicate with the operating system to function on the computer. The operating system usually offers some level of system resources configuration so you can set up the computer to serve specific functions efficiently.
Operating systems can be classified into the following categories:
MultiuserAllows two or more users to run programs at the same time. Some operating systems permit hundreds or thousands of concurrent users.
MultiprocessingAllows you to run a program on more than one CPU.
MultitaskingAllows multiple programs to run concurrently.
MultithreadingAllows different parts of a single program to run concurrently.
Real timeResponds to input instantly. General-purpose operating systems, such as DOS and UNIX, are not real time.
The operating system is a foundation component of an Internet business application's performance. Resources configured and managed by the operating system are often involved with a performance issue. You can often tune or configure the operating system to support specific system requirements.
Most UNIX variants include some type of kernel-level tuning, along with basic tools to monitor CPU, disk, and memory usage. UNIX kernels tend to have many configurable parameters that you can fine-tune for specific applications.
A widespread misconception is that the Windows NT kernel is not configurable. The Windows NT kernel is largely self-tuning. The virtual memory, thread scheduling, and I/O subsystems all dynamically adjust their resource usage and priority to maximize throughput.
When benchmarking the UNIX and Windows NT operating systems, the differences between them are evident. The UNIX approach is to tweak kernel parameters for maximum advantage in the benchmark. The Windows NT approach is to let the kernel tune itself for whatever load is placed on it.
A variety of operating systems are available on the market. From a performance perspective, you should understand how the operating system is tuned, whether self-tuning or configurable, and realize that your Internet business application's performance can be greatly affected if it is not set up properly. Also consider Total Cost of Ownership (TCO). Service and support contracts often equal or surpass the cost of the operating system license.
The TCO for commercial operating systems is becoming increasingly expensive and, as a result, open-source Linux distributions are rapidly gaining market share. International Data Corporation estimates that Linux will have 32 percent of the server market by the end of 2002 and garner 9 percent of corporate IT budget spending. With over 60 percent of active servers running Apache according to Netcraft.com's long-running survey, a free web server running almost entirely on Linux operating systems is an attractive and popular combination. Linux service and support options have also arrived, making the decision to run Linux servers even easier for both small businesses and large corporations.
A growing workload affects your application's performance by directly increasing hardware resource consumption. For example, if you have twice as many users as normal after a marketing campaign, you probably also have twice as many database accesses and double the amount of related network traffic.
Ultimately, your application's performance potential is largely constrained by the available hardware. It does little good to optimize your application if the hardware infrastructure is slow and inadequate. The following hardware factors affect performance:
CPU consumptionHigh CPU use makes every task take longer.
Memory allocationInadequate memory causes slow paging to disk.
Disk/Input/Output subsystemToo much disk input or output slows your application.
Network hardwarePoor-quality network hardware limits throughput.
In a typical enterprise environment, the hardware available to your application is only upgraded occasionally. For most of your application's lifecycle, it will run in the same hardware environment. You should periodically monitor hardware resource consumption to ensure the current and future application performance you expect.
Facts about hardware and its affect on performance follow:
Hard diskOptimizing the hard disk can relieve server performance problems. Using RAID technology in combination with caching on the drive will improve hard drive performance. Increasing the amount of RAM can increase disk-bound application performance, allowing more data to be held in RAM. If you find that the percentage of time spent hitting the disks is high but the CPU and network utilization is low, look at your hard drives for issues.
MemoryRAM configuration on the server can affect performance. If your server does not have enough memory, it will run slower as it swaps information to exchange files on its hard disk. This memory swapping to hard disk is called virtual memory. The performance impact occurs because your hard drive operates much slower than direct RAM access. Not only do the disk swaps interrupt the CPU from processing data, they also prevent the disk from accessing files and data needed by you and your users. If the amount of RAM in your system is so low that your computer spends most of its time reading and writing to its virtual memory, your system RAM is not going to be very useful. If you notice a high number of virtual memory page read faults when monitoring your system, this might indicate that more memory can help increase performance.
CPUCPU performance is the most important factor when you want to get the most work done in the shortest possible time. If performing a particular operation takes five seconds, is it worth spending an extra $10,000 to do the operation in three seconds? However, if the operation takes five hours and you could reduce the time to one or two hours, the additional expense might be worthwhile. Computational tasks are most affected by CPU performance. Be sure to consider investment protection. The CPU that seems adequate today might not meet your needs in the near future. Hardware development's rapid pace makes existing systems obsolete in a short period of time.
Network capacityNetwork capacity is reached when the network is saturated, and the performance will degrade regardless of the server's size. Remember the following capacity issues:
Be aware of the saturation of the network card. If you are running on a low-capacity card, you will saturate it quickly.
On applications that are accessed across a WAN link (virtually all Internet applications), the WAN link is the slowest network segment involved. Upgrading your server environment to Gigabit Ethernet won't do much good if your bottleneck is the T1 to the Internet. Your application design should take the expected client connection speeds into account. For example, many people will access a public Internet application through 56 Kbps modems.
Limiting connections is an excellent way to plan network capacity. Most browsers usually take up to four simultaneous connections to download text and graphics for a web page.
Set connection timeout values in cases where connections do not break on their own. HTTP Keep-Alives happen when you open a browser session but keep the session running or connection established between the client and server.
A new feature, called HTTP Compression, actually compresses data before it goes out on the wire. There's a trade-off, of course, because more CPU resources are required for the compression.
An integral part of the Cisco end-to-end quality of service (QoS) and intelligent network services, Cisco QoS Policy Manager (QPM) allows network administrators to protect business-critical application performance. By leveraging QoS mechanisms in LAN and WAN switching equipment along with application recognition technologies delivered through advanced Cisco IOS Software, QPM provides enterprise networks with centralized policy control and automated policy deployment.
A full-featured QoS policy system, QPM enables differentiated services for web-based applications, voice traffic, Internet appliances, and business-critical processes, ensuring QoS for network-intensive, critical applications. Relying on differentiated services to enforce QoS end-to-end, QPM delivers the key benefits of enabling advanced differentiated services across LAN and WAN policy domains, automating QoS configuration and deployment, and improving multiservice performance. Using QPM, network managers can quickly apply a mix of QoS policy objectives to protect business-critical application performance.
Now available as an add-in module to QPM is QPM-COPS. An integral part of the Cisco end-to-end QoS and content-aware network initiative, QPM-COPS provides intelligent traffic enforcement through application and user-aware, directory-enabled policy control.
An important Internet business applications consideration is the application client. (See Chapter 5, "Build or Buy," for a definition of application clients and the client-server model.) The client can vary in types of connectivity, hardware, operating systems, and a variety of other factors that can affect performance. Using the client's native system monitoring tools can be a great method to determine where performance issues exist on the client. Good application and code distribution methodologies can ensure that the proper software and software versions are available on the client.
Understanding the target client in the production environment is critical when upgrading or deploying an application. Memory, disk, CPUs, and applications that are expected to run concurrently must be taken into account. A variety of design techniques can be implemented to minimize the adverse performance impacts on the client in the overall performance equation.
The client is typically a shared component in any operating environment; decisions made by others often negatively affect your application in the environment. For this reason, a good monitoring methodology must be in place to support the client portion of any Internet business strategy.
Software maintenance is the modification of a software product that improves performance or other attributes to adapt the product to a modified environment. An Internet business application's architecture can greatly affect maintainability and manageability. New technologies employed to meet increasing business demands often negatively affect an application's manageability. This consequence can occur when areas supporting a new architectural component force software or hardware upgrades.
Distribution and management of an Internet business application's elements need to be taken into account. The application distribution's level and volume need to be considered when the application's architecture is selected. Better methodologies are being developed every day to manage the changes required to support an Internet business application. Multiple reasons exist to support the software maintenance:
Adapting to change in the operational environment
Correcting errors in the software
Performing preventive maintenance in anticipation of problems
Performing software enhancements and making changes to accommodate other changes in the operational environment are inevitable, and usually desired. Error correction and preventive maintenance are areas that can always be improved upon. Commonly, 40 to 70 percent of a software application's budget is devoted to maintenance activity over the life of the software.
Many potential reasons exist for having maintenance performed on your software, including a lack of proper software testing and validation when changes are performed, or when new software is introduced into the production environment. The testing phase of software development is often compromised to meet project goals because it falls near the end of the delivery cycle.
Lack of realistic test scripts and data can create the need to maintain software. Nonproduction-like test environments can also create a need for software maintenance. Replicating the complete production environment in which software is expected to run is often impractical because of the large number of variables existing between the production environment and the test environment.
Improper, outdated, or missing documentation relating to the software can increase maintenance activity and cost. Documentation, such as requirements and design documents, is not always available when a maintenance opportunity arises. When documentation is available, it is often outdated and unreliable because it was not properly maintained when the software was modified.
Using standards and tools within an organization can help reduce the cost of software maintenance. Standards and tools provide known quantities and information about software entities that might be candidates for maintenance. A successful software project creates a coding style guide and enforces common coding practices across teams and developers to ensure high-quality work.
You should create software repositories, so that each developer is working with the exact same set of tools and utilities. Standard build environments are also important to keeping a project manageable, and for anticipating the need for a structured approach to environment variable naming and referencing. For example, templates for application start/stop scripts must be created so that software services and components can be killed and reset cleanly and consistently. Such scripts need to be created for web services, application services, and database services. All such scripts and environment configurations should be checked into version control software so that changes are tracked and matched with the corresponding code base.
Failing to manage or address software maintenance can cause an Internet business application to stagnate. Because of the large amount of maintenance activity required to keep the system up and running, there are no resources to pursue other development opportunities. Poor application maintenance practices increase the risk and lower the quality of an application by requiring more frequent changes in a less-controlled manner.
Manageability addresses not only software maintenance and management, but also the idea of designing applications and networked applications to be easily managed and maintained. This approach to application requirements for manageability expands to include several different areas. The manageability discussion in this section covers the following areas:
- Application Services Administration
- Application Integration Tools
- Network Manageability
- Licensing and Maintenance Agreements
- IT Operations Management
- Use of External Service Providers
- Environment Architecture Considerations
- Version Control
Application Services Administration
Application services administration covers a large area related to the configuration and maintenance of applications and services. With the introduction of new methodologies for implementing systems, many writing activities that require code in an application are now performed by modifying a package or service's configuration.
Database administration (DBA) associated with an application can create major issues if not planned and implemented properly. Database administration can consume a large amount of time and increase an application's cost because of changes to data attributes, data volume, backup, recovery, and data distribution.
Typical DBA responsibilities include the following:
Installing, configuring, and upgrading database server software and related products
Evaluating database features and database-related products
Establishing and maintaining sound backup and recovery policies and procedures
Taking care of the database design and implementation
Implementing and maintaining database security (creating and maintaining users and roles, and assigning privileges)
Tuning and monitoring database performance
Tuning and monitoring application performance
Setting up and maintaining documentation and standards
Planning growth and changes (capacity planning)
Working as part of a team and providing 24-hour support when required
Performing general, technical troubleshooting interface with the database provider for technical support
Facilitating sharing common data by overseeing proper key management and data dictionary maintenance
Application Integration Tools
Application integration tools can create manageability concerns for the Internet business application. Tools such as electronic data interchange (EDI), Common Object Request Broker Architecture (CORBA), and Component Object Model (COM) require a considerable amount of configuration and management to work properly and to keep pace with the dynamics of a business operational environment. These files are often not maintained under proper configuration management practices and are poorly documented. To guard against some of the pitfalls associated with these tools, configuration management must maintain configuration and associated parameters. In addition, the parameters, values, and justification for the chosen values must be included.
A system's network component is typically in a state of change as new applications become available and others are removed. Network management requires the creation of several different kinds of diagrams: physical, logical, system, security, and data flow. Physical network diagrams also include rack elevations, which detail the components that are placed within each rack in a server room. Network management involves using monitoring tools to access the network's performance. These tools, along with a solid network management and change control process, can minimize the impact of network administration.
The maintenance of elements, such as routing tables, firewalls, and connections, can create problems in the production environment if not diagramed, managed, and communicated properly. The following list details typical network management activities:
Assessing network management functions (personnel, organizational, procedures, systems, and so on)
Identifying, creating, and implementing automation and behavior models to improve environment management and reduce manual processes
Updating network management application configurations documentation
Capacity planning and optimization
Controlling network management system configurations
Collecting system usage statistics and generating reports
Maintaining and configuring backup network management systems
Defining and generating network performance reports
Analyzing system and network performance trends
Identifying and resolving network faults for network uptime, stability, and customer satisfaction
Optimizing complex network management applications
Ensuring operational consistency
The following discussion provides examples of network diagrams for a basic e-commerce web site.
Figure 4-2 depicts an e-commerce network from the physical perspective. You should create and maintain a physical network diagram so that all administrative and support teams can readily identify the components and systems with which they work.
Figure 4-2 e-commerce Physical Network Diagram
Figure 4-3 is a logical network diagram; it is necessary to understand the underlying structure of a set of servers used to create an Internet business solution, such as an e-commerce web site.
Figure 4-3 e-commerce Logical Network Diagram
Figure 4-4 shows information about the systems that comprise the example e-commerce web site. This systems inventory and basic system configuration information is necessary to know how each system is loaded, and whether there is additional capacity to add more CPUs, RAM, or disk space before needing to purchase additional systems.
Figure 4-4 e-commerce Systems Information
Figure 4-5 contains information about the data flow in the e-commerce production environment. A data flow diagram depicts the essential transaction directions between components.
Figure 4-5 e-commerce Data Flow Network Diagram
Figure 4-6 is a rack elevation diagram of the e-commerce web site. Rack elevation diagrams detail the use of available server room rack space and make sure that physical systems have sufficient space before installation occurs.
Figure 4-6 e-commerce Rack Elevation Diagram
Licensing and Maintenance Agreements
A clear understanding of licensing associated with the tools provided by external vendors is important when considering changes and upgrades to an Internet business application.
Software licensing uses many different models, some of which are based on the number of physical machines involved, and others on the number and speed of CPUs. Being aware of the licensing model for key software and how upgrades and changes to accommodate growth affects your project costs is an important factor in choosing your software products. Database and middleware application licensing can cost as much as $100,000 per CPU.
You should also take the maintenance agreements associated with external software into account. A maintenance agreement is typically an annual agreement whereby the vendor agrees to provide specific support for its product, often including upgrades. A maintenance agreement's cost is typically between 5 to 25 percent of the total cost of the software package.
Maintenance agreements typically provide one of three levels of support, with "platinum" support being the highest. At this level, maintenance support is provided within one hour of a reported problem. Critical production systems generally require platinum-level support even if they are equipped with clustered or fault-tolerant components. Staging and development systems are usually maintained with "gold" or "silver" maintenance packages where service and support work is performed the next business day, or as available.
IT Operations Management
A specific team often manages the day-to-day operation of an enterprise Internet business application production environment. The operations management team typically provides the following services:
- 24-hour, 365-day coverage
- Console operations
- Tape handling, library, and storage
- Onsite maintenance staff
- Help desk services
- Technical support
- Backup and disaster recovery
- Application/component distribution
This group typically delivers the services related to the Service Level Agreement (SLA). Take the SLA into consideration from the initiation of any modification to the operating environment; otherwise, meeting the expectations detailed in the SLA might be difficult.
Use of External Service Providers
One method of handling an Internet business application's manageability is to use external service providers. The service providers have expertise in the areas in which they provide service. An Application Service Provider (ASP) can help a business handle the manageability issues associated with an application, allowing the business to focus on its customers. Remember, however, that risk is involved with allowing an external entity access to certain information about your business's core functionality.
Environment Architecture Considerations
During an Internet business application lifecycle, you must take an application that is currently in production, maintain that version, and create a subsequent "enhanced" version. You must consider the components management associated with maintaining a proper relationship among these environments.
Version control is crucial to performing staged software code releases. A release schedule is the most often-used technique for supporting an application's manageability. Version control assists in the management of parallel environments. Enterprise-wide systems typically involve a number of components, and when upgrades are desired, they are sometimes difficult to develop and test in the upgraded environment while the environments needed to support the version in production are maintained.
A release schedule allows different areas of an enterprise to introduce enhancements and new versions, and test interfaces along with shared systems components that can be in the process of being upgraded at the same time. As technology advances, new methods for dealing with the multiple environment issue are becoming available. Creating environments in which to perform stress and volume testing presents a challenge because of the expense and resources involved in reproducing a production-like environment where proper conditions are present during testing and validation.