Practical System Reliability
March 2009, Wiley-IEEE Press
Written by a panel of authors with a wealth of industry experience, the methods and concepts presented here give readers a solid understanding of modeling and managing system and software availability and reliability through the development of real applications and products. The modeling and prediction techniques and tools are customer-focused and data-driven, and are also aligned with industry standards (Telcordia, TL 9000, ISO, etc.). Readers will get a clear understanding about what real-world reliability and availability mean through step-by-step discussions of:
- System availability
- Conceptual model of reliability and availability
- Why availability varies between customers
- Modeling availability
- Estimating parameters and availability from field data
- Estimating input parameters from laboratory data
- Estimating input parameters in the architecture/design stage
- Prediction accuracy
- Connecting the dots
This book can be used by system architects, engineers, and developers to better understand and manage the reliability/availability of their products; quality engineers to grasp how software and hardware quality relate to system availability; and engineering students as part of a short course on system availability and software reliability.
2 System Availability.
2.1 Availability, Service and Elements.
2.2 Classical View.
2.3 Customers’ View.
2.4 Standards View.
3 Conceptual Model of Reliability and Availability.
3.1 Concept of Highly Available Systems.
3.2 Conceptual Model of System Availability.
3.4 Outage Resolution.
3.5 Downtime Budgets.
4 Why Availability Varies Between Customers.
4.1 Causes of Variation in Outage Event Reporting.
4.2 Causes of Variation in Outage Duration.
5 Modeling Availability.
5.1 Overview of Modeling Techniques.
5.2 Modeling Definitions.
5.3 Practical Modeling.
5.4 Widget Example.
5.5 Alignment with Industry Standards.
6 Estimating Parameters and Availability from Field Data.
6.1 Self-Maintaining Customers.
6.2 Analyzing Field Outage Data.
6.3 Analyzing Performance and Alarm Data.
6.4 Coverage Factor and Failure Rate.
6.5 Uncovered Failure Recovery Time.
6.6 Covered Failure Detection and Recovery Time.
7 Estimating Input Parameters from Lab Data.
7.1 Hardware Failure Rate.
7.2 Software Failure Rate.
7.3 Coverage Factors.
7.4 Timing Parameters.
7.5 System-Level Parameters.
8 Estimating Input Parameters in the Architecture/Design Stage.
8.1 Hardware Parameters.
8.2 System-Level Parameters.
8.3 Sensitivity Analysis.
9 Prediction Accuracy.
9.1 How Much Field Data Is Enough?
9.2 How Does One Measure Sampling and Prediction Errors?
9.3 What Causes Prediction Errors?
10 Connecting the Dots.
10.1 Set Availability Requirements.
10.2 Incorporate Architectural and Design Techniques.
10.3 Modeling to Verify Feasibility.
10.5 Update Availability Prediction.
10.6 Periodic Field Validation and Model Update.
10.7 Building an Availability Roadmap.
10.8 Reliability Report.
Appendix A System Reliability Report outline.
1 Executive Summary.
2 Reliability Requirements.
3 Unplanned Downtime Model and Results.
Annex A Reliability Definitions.
Annex B References.
Annex C Markov Model State-Transition Diagrams.
Appendix B Reliability and Availability Theory.
1 Reliability and Availability Definitions.
2 Probability Distributions in Reliability Evaluation.
3 Estimation of Confidence Intervals.
Appendix C Software Reliability Growth Models.
1 Software Characteristic Models.
2 Nonhomogeneous Poisson Process Models.
Appendix D Acronyms and Abbreviations.
Appendix E Bibliography.
About the Authors.
Xuemei Zhang, PhD, is a principal member of the technical staff in the Network Design and Performance Analysis Department at AT&T Labs. She has been working on reliability and performance analysis of wireline and wireless communications systems and networks. Her major work and research areas are in system and architectural reliability and performance, product and solution reliability and performance modeling, and software reliability.
Douglas A. Kimber retired from Alcatel-Lucent as a staff reliability engineer. Throughout his career at Bell Labs, Lucent Technologies, and Alcatel-Lucent, he developed high reliability hardware and software platforms, applications, and systems, and then transitioned to reliability engineering where he did reliability modeling and analysis.