Tutorial on Hardware and Software Reliability, Maintainability and Availability
Computer systems, whether hardware or software, are subject to failure. Precisely, what is a failure? It is defined as: The inability of a system or system component to perform a required function within specified limits. Afailure may be produced when a fault is encountered and a loss of the expected service to the user results [IEEE/AIAA P1633]. This brings us to the question of what is a fault? A fault is defect in the hardware or computer code that can be the cause of one or more failures. Software-based systems have become the dominant player in the computer systems world. Since it is imperative that computer systems operate reliably, considering the criticality of software, particularly in safety critical systems, the IEEE and AIAA commissioned the development of the Recommended Practice on Software Reliability. This tutorial serves as a companion document with the purpose of elaborating on key software reliability process practices in more detail than can be specified in the Recommended Practice. However, since other subjects like maintainability and availability are also covered, the tutorial can be used as a stand-alone document. While the focus of the Recommended Practice is software reliability, software and hardware do not operate in a vacuum. Therefore, both software and hardware are addressed in this tutorial in an integrated fashion. The narrative of the tutorial is augmented with illustrative solved problems. The recommended practice [IEEE P1633] is a composite of models and tools and describes the "what and how" of software reliability engineering. It is important for an organization to have a disciplined process if it is to produce high reliability software. This process uses a life cycle approach to software reliability that takes into account the risk to reliability due to requirements changes. A requirements change may induce ambiguity and uncertainty in the development process that cause errors in implementing the changes. Subsequently, these errors may propagate through later phases of development and maintenance. In view of the life cycle ramifications of the software reliability process, maintenance is included in this tutorial. Furthermore, because reliability and maintainability determine availability, the latter is also included.
Dr. Norman F. Schneidewind is Professor Emeritus of Information Sciences in the Department of Information Sciences and the Software Engineering Group at the Naval Postgraduate School. He is now doing research and publishing in software reliability and metrics with his consulting company Computer Research. Dr. Schneidewind is a Fellow of the IEEE, elected in 1992 for contributions to software measurement models in reliability and metrics, and for leadership in advancing the field of software maintenance. In 2001, he received the IEEE Reliability Engineer of the Year award from the IEEE Reliability Society. In 1993 and 1999, he received awards for Outstanding Research Achievement by the Naval Postgraduate School. Dr. Schneidewind was selected for an IEEE USA Congressional Fellowship for 2005 and worked with the Committee on Homeland Security and Government Affairs, United States Senate, focusing on homeland security and cyber security.