Wiley-IEEE Press

Home Home About Wiley-IEEE Press Contact Us
Print this page Share

Reliability and Availability of Cloud Computing

ISBN: 978-1-118-17701-3
352 pages
September 2012, Wiley-IEEE Press
Reliability and Availability of Cloud Computing (1118177010) cover image

Description

A holistic approach to service reliability and availability of cloud computing

Reliability and Availability of Cloud Computing provides IS/IT system and solution architects, developers, and engineers with the knowledge needed to assess the impact of virtualization and cloud computing on service reliability and availability. It reveals how to select the most appropriate design for reliability diligence to assure that user expectations are met.

Organized in three parts (basics, risk analysis, and recommendations), this resource is accessible to readers of diverse backgrounds and experience levels. Numerous examples and more than 100 figures throughout the book help readers visualize problems to better understand the topic—and the authors present risks and options in bulleted lists that can be applied directly to specific applications/problems.

Special features of this book include:

  • Rigorous analysis of the reliability and availability risks that are inherent in cloud computing
  • Simple formulas that explain the quantitative aspects of reliability and availability
  • Enlightening discussions of the ways in which virtualized applications and cloud deployments differ from traditional system implementations and deployments
  • Specific recommendations for developing reliable virtualized applications and cloud-based solutions

Reliability and Availability of Cloud Computing is the guide for IS/IT staff in business, government, academia, and non-governmental organizations who are moving their applications to the cloud. It is also an important reference for professionals in technical sales, product management, and quality management, as well as software and quality engineers looking to broaden their expertise.

See More

Table of Contents

Figures xvii

Tables xxi

Equations xxiii

Introduction xxv

I BASICS 1

1 CLOUD COMPUTING 3

1.1 Essential Cloud Characteristics 4

1.2 Common Cloud Characteristics 6

1.3 But What, Exactly, Is Cloud Computing? 7

1.4 Service Models 9

1.5 Cloud Deployment Models 11

1.6 Roles in Cloud Computing 12

1.7 Benefi ts of Cloud Computing 14

1.8 Risks of Cloud Computing 15

2 VIRTUALIZATION 16

2.1 Background 16

2.2 What Is Virtualization? 17

2.3 Server Virtualization 19

2.4 VM Lifecycle 23

2.5 Reliability and Availability Risks of Virtualization 28

3 SERVICE RELIABILITY AND SERVICE AVAILABILITY 29

3.1 Errors and Failures 30

3.2 Eight-Ingredient Framework 31

3.3 Service Availability 34

3.4 Service Reliability 43

3.5 Service Latency 46

3.6 Redundancy and High Availability 50

3.7 High Availability and Disaster Recovery 56

3.8 Streaming Services 58

3.9 Reliability and Availability Risks of Cloud Computing 62

II ANALYSIS 63

4 ANALYZING CLOUD RELIABILITY AND AVAILABILITY 65

4.1 Expectations for Service Reliability and Availability 65

4.2 Risks of Essential Cloud Characteristics 66

4.3 Impacts of Common Cloud Characteristics 70

4.4 Risks of Service Models 72

4.5 IT Service Management and Availability Risks 74

4.6 Outage Risks by Process Area 80

4.7 Failure Detection Considerations 83

4.8 Risks of Deployment Models 87

4.9 Expectations of IaaS Data Centers 87

5 RELIABILITY ANALYSIS OF VIRTUALIZATION 90

5.1 Reliability Analysis Techniques 90

5.2 Reliability Analysis of Virtualization Techniques 95

5.3 Software Failure Rate Analysis 100

5.4 Recovery Models 101

5.5 Application Architecture Strategies 108

5.6 Availability Modeling of Virtualized Recovery Options 110

6 HARDWARE RELIABILITY, VIRTUALIZATION, AND SERVICE AVAILABILITY 116

6.1 Hardware Downtime Expectations 116

6.2 Hardware Failures 117

6.3 Hardware Failure Rate 119

6.4 Hardware Failure Detection 121

6.5 Hardware Failure Containment 122

6.6 Hardware Failure Mitigation 122

6.7 Mitigating Hardware Failures via Virtualization 124

6.8 Virtualized Networks 127

6.9 MTTR of Virtualized Hardware 129

6.10 Discussion 131

7 CAPACITY AND ELASTICITY 132

7.1 System Load Basics 132

7.2 Overload, Service Reliability, and Service Availability 135

7.3 Traditional Capacity Planning 136

7.4 Cloud and Capacity 137

7.5 Managing Online Capacity 144

7.6 Capacity-Related Service Risks 147

7.7 Capacity Management Risks 153

7.8 Security and Service Availability 157

7.9 Architecting for Elastic Growth and Degrowth 162

8 SERVICE ORCHESTRATION ANALYSIS 164

8.1 Service Orchestration Definition 164

8.2 Policy-Based Management 166

8.3 Cloud Management 168

8.4 Service Orchestration’s Role in Risk Mitigation 169

9 GEOGRAPHIC DISTRIBUTION, GEOREDUNDANCY, AND DISASTER RECOVERY 174

9.1 Geographic Distribution versus Georedundancy 175

9.2 Traditional Disaster Recovery 175

9.3 Virtualization and Disaster Recovery 177

9.4 Cloud Computing and Disaster Recovery 178

9.5 Georedundancy Recovery Models 180

9.6 Cloud and Traditional Collateral Benefits of Georedundancy 180

9.7 Discussion 182

III RECOMMENDATIONS 183

10 APPLICATIONS, SOLUTIONS, AND ACCOUNTABILITY 185

10.1 Application Configuration Scenarios 185

10.2 Application Deployment Scenario 187

10.3 System Downtime Budgets 188

10.4 End-to-End Solutions Considerations 197

10.5 Attributability for Service Impairments 201

10.6 Solution Service Measurement 204

10.7 Managing Reliability and Service of Cloud Computing 207

11 RECOMMENDATIONS FOR ARCHITECTING A RELIABLE SYSTEM 209

11.1 Architecting for Virtualization and Cloud 209

11.2 Disaster Recovery 216

11.3 IT Service Management Considerations 217

11.4 Many Distributed Clouds versus Fewer Huge Clouds 224

11.5 Minimizing Hardware-Attributed Downtime 225

11.6 Architectural Optimizations 231

12 DESIGN FOR RELIABILITY OF VIRTUALIZED APPLICATIONS 244

12.1 Design for Reliability 244

12.2 Tailoring DfR for Virtualized Applications 246

12.3 Reliability Requirements 248

12.4 Qualitative Reliability Analysis 256

12.5 Quantitative Reliability Budgeting and Modeling 259

12.6 Robustness Testing 260

12.7 Stability Testing 267

12.8 Field Performance Analysis 268

12.9 Reliability Roadmap 269

12.10 Hardware Reliability 270

13 DESIGN FOR RELIABILITY OF CLOUD SOLUTIONS 271

13.1 Solution Design for Reliability 271

13.2 Solution Scope and Expectations 273

13.3 Reliability Requirements 275

13.4 Solution Modeling and Analysis 279

13.5 Element Reliability Diligence 285

13.6 Solution Testing and Validation 285

13.7 Track and Analyze Field Performance 288

13.8 Other Solution Reliability Diligence Topics 292

14 SUMMARY 296

14.1 Service Reliability and Service Availability 297

14.2 Failure Accountability and Cloud Computing 299

14.3 Factoring Service Downtime 301

14.4 Service Availability Measurement Points 303

14.5 Cloud Capacity and Elasticity Considerations 306

14.6 Maximizing Service Availability 306

14.7 Reliability Diligence 309

14.8 Concluding Remarks 310

Abbreviations 311

References 314

About the Authors 318

Index 319

See More

Author Information

ERIC BAUER is a reliability engineering manager in the Software, Solutions and Services Group of Alcatel-Lucent. The holder of more than a dozen U.S. patents, he is the author of Design for Reliability: Information and Computer-Based Systems, Beyond Redundancy: How Geographic Redundancy Can Improve Service Availability and Reliability of Computer-Based Systems, and Practical System Reliability, also available from Wiley-IEEE Press.

RANDEE ADAMS is a consulting member of technical staff in the Software, Solutions and Services Group of Alcatel-Lucent and the coauthor of Beyond Redundancy: How Geographic Redundancy Can Improve Service Availability and Reliability of Computer-Based Systems.

See More

Reviews

“For sure, specialists responsible for recommending, providing, or managing cloud platforms for either private or public cloud will profit with having this work on their shelf. I would also like to highly recommend this position for people new to the considered concepts of cloud computing or computer systems reliability as it provides an excellent background for the both areas.”  (IEEE Communications Magazine, 1 October 2013)

“Therefore, it will probably only be of real interest to those who are directly involved in improving or implementing their own systems in a cloud platform.”  (Computing Reviews, 30 November 2012)

 

See More

Related Titles

Learn more about