Skip to main content

Practical System Reliability

Practical System Reliability

Eric Bauer, Xuemei Zhang, Douglas A. Kimber

ISBN: 978-0-470-45538-8

Mar 2009, Wiley-IEEE Press

300 pages

Select type: E-Book



Learn how to model, predict, and manage system reliability/availability throughout the development life cycle

Written by a panel of authors with a wealth of industry experience, the methods and concepts presented here give readers a solid understanding of modeling and managing system and software availability and reliability through the development of real applications and products. The modeling and prediction techniques and tools are customer-focused and data-driven, and are also aligned with industry standards (Telcordia, TL 9000, ISO, etc.). Readers will get a clear understanding about what real-world reliability and availability mean through step-by-step discussions of:

  • System availability
  • Conceptual model of reliability and availability
  • Why availability varies between customers
  • Modeling availability
  • Estimating parameters and availability from field data
  • Estimating input parameters from laboratory data
  • Estimating input parameters in the architecture/design stage
  • Prediction accuracy
  • Connecting the dots

This book can be used by system architects, engineers, and developers to better understand and manage the reliability/availability of their products; quality engineers to grasp how software and hardware quality relate to system availability; and engineering students as part of a short course on system availability and software reliability.



1 Introduction.

2 System Availability.

2.1 Availability, Service and Elements.

2.2 Classical View.

2.3 Customers’ View.

2.4 Standards View.

3 Conceptual Model of Reliability and Availability.

3.1 Concept of Highly Available Systems.

3.2 Conceptual Model of System Availability.

3.3 Failures.

3.4 Outage Resolution.

3.5 Downtime Budgets.

4 Why Availability Varies Between Customers.

4.1 Causes of Variation in Outage Event Reporting.

4.2 Causes of Variation in Outage Duration.

5 Modeling Availability.

5.1 Overview of Modeling Techniques.

5.2 Modeling Definitions.

5.3 Practical Modeling.

5.4 Widget Example.

5.5 Alignment with Industry Standards.

6 Estimating Parameters and Availability from Field Data.

6.1 Self-Maintaining Customers.

6.2 Analyzing Field Outage Data.

6.3 Analyzing Performance and Alarm Data.

6.4 Coverage Factor and Failure Rate.

6.5 Uncovered Failure Recovery Time.

6.6 Covered Failure Detection and Recovery Time.

7 Estimating Input Parameters from Lab Data.

7.1 Hardware Failure Rate.

7.2 Software Failure Rate.

7.3 Coverage Factors.

7.4 Timing Parameters.

7.5 System-Level Parameters.

8 Estimating Input Parameters in the Architecture/Design Stage.

8.1 Hardware Parameters.

8.2 System-Level Parameters.

8.3 Sensitivity Analysis.

9 Prediction Accuracy.

9.1 How Much Field Data Is Enough?

9.2 How Does One Measure Sampling and Prediction Errors?

9.3 What Causes Prediction Errors?

10 Connecting the Dots.

10.1 Set Availability Requirements.

10.2 Incorporate Architectural and Design Techniques.

10.3 Modeling to Verify Feasibility.

10.4 Testing.

10.5 Update Availability Prediction.

10.6 Periodic Field Validation and Model Update.

10.7 Building an Availability Roadmap.

10.8 Reliability Report.

11 Summary.

Appendix A System Reliability Report outline.

1 Executive Summary.

2 Reliability Requirements.

3 Unplanned Downtime Model and Results.

Annex A Reliability Definitions.

Annex B References.

Annex C Markov Model State-Transition Diagrams.

Appendix B Reliability and Availability Theory.

1 Reliability and Availability Definitions.

2 Probability Distributions in Reliability Evaluation.

3 Estimation of Confidence Intervals.

Appendix C Software Reliability Growth Models.

1 Software Characteristic Models.

2 Nonhomogeneous Poisson Process Models.

Appendix D Acronyms and Abbreviations.

Appendix E Bibliography.


About the Authors.