Reliability modeling for safety critical software
->>>> Click Here to Download <<<<<<<-
Tragedies in Therac 25 [Therac 25] , a computer-controlled radiation-therapy machine in the year , caused by the software not being able to detect a race condition, alerts us that it is dangerous to abandon our old but well-understood mechanical safety control and surrender our lives completely to software controlled safety mechanism.
Software can make decisions, but can just as unreliable as human beings. The British destroyer Sheffield was sunk because the radar system identified an incoming missile as "friendly".
Software can also have small unnoticeable errors or drifts that can culminate into a disaster. On February 25, , during the Golf War, the chopping error that missed 0.
Fixing problems may not necessarily make the software more reliable. On the contrary, new serious problems may arise. In , after changing three lines of code in a signaling program which contains millions lines of code, the local telephone systems in California and along the Eastern seaboard came to a stop.
After the success of Ariane 4 rocket, the maiden flight of Ariane 5 ended up in flames while design defects in the control software were unveiled by faster horizontal drifting speed of the new rocket. There are much more scary stories to tell. This makes us wondering whether software is reliable at all, whether we should use software in safety-critical embedded applications.
With processors and software permeating safety critical embedded world, the reliability of software is simply a matter of life and death. Are we embedding potential disasters while we embed software into systems? According to ANSI, Software Reliability is defined as: the probability of failure-free software operation for a specified period of time in a specified environment. Electronic and mechanical parts may become "old" and wear out with time and usage, but software will not rust or wear-out during its life cycle.
Software will not change over time unless intentionally changed or upgraded. Software Reliability is an important to attribute of software quality, together with functionality, usability, performance, serviceability, capability, installability, maintainability, and documentation. Software Reliability is hard to achieve, because the complexity of software tends to be high.
While any system with a high degree of complexity, including software, will be hard to reach a certain level of reliability, system developers tend to push complexity into the software layer, with the rapid growth of system size and ease of doing so by upgrading the software. For example, large next-generation aircraft will have over one million source lines of software on-board; next-generation air traffic control systems will contain between one and two million lines; the upcoming international Space Station will have over two million lines on-board and over ten million lines of ground support software; several major life-critical defense systems will have over five million source lines of software.
Emphasizing these features will tend to add more complexity to software. Software failures may be due to errors, ambiguities, oversights or misinterpretation of the specification that the software is supposed to satisfy, carelessness or incompetence in writing code, inadequate testing, incorrect or unexpected usage of the software or other unforeseen problems.
Hardware faults are mostly physical faults , while software faults are design faults , which are harder to visualize, classify, detect, and correct. In hardware, design faults may also exist, but physical faults usually dominate.
In software, we can hardly find a strict corresponding counterpart for "manufacturing" as hardware manufacturing process, if the simple action of uploading software modules into place does not count.
Therefore, the quality of software will not change once it is uploaded into the storage and start running. Trying to achieve higher reliability by simply duplicating the same software modules will not work, because design faults can not be masked off by voting. A partial list of the distinct characteristics of software compared to hardware is listed below [Keene94] :.
Over time, hardware exhibits the failure characteristics shown in Figure 1, known as the bathtub curve. Period A, B and C stands for burn-in phase, useful life phase and end-of-life phase. A detailed discussion about the curve can be found in the topic Traditional Reliability. Software reliability, however, does not show the same characteristics similar as hardware. A possible curve is shown in Figure 2 if we projected software reliability on the same axes.
One difference is that in the last phase, software does not have an increasing failure rate as hardware does. In this phase, software is approaching obsolescence; there are no motivation for any upgrades or changes to the software. Therefore, the failure rate will not change. The second difference is that in the useful-life phase, software will experience a drastic increase in failure rate each time an upgrade is made.
The failure rate levels off gradually, partly because of the defects found and fixed after the upgrades. Figure 2. Revised bathtub curve for software reliability. The upgrades in Figure 2 imply feature upgrades, not upgrades for reliability. For feature upgrades, the complexity of software is likely to be increased, since the functionality of software is enhanced. Even bug fixes may be a reason for more software failures, if the bug fix induces other defects into software.
For reliability upgrades, it is possible to incur a drop in software failure rate, if the goal of the upgrade is enhancing software reliability, such as a redesign or reimplementation of some modules using better engineering approaches, such as clean-room method. A proof can be found in the result from Ballista project, robustness testing of off-the-shelf software Components.
Since software robustness is one aspect of software reliability, this result indicates that the upgrade of those systems shown in Figure 3 should have incorporated reliability upgrades. Since Software Reliability is one of the most important aspects of software quality, Reliability Engineering approaches are practiced in software field as well.
Software Reliability Engineering SRE is the quantitative study of the operational behavior of software-based systems with respect to user requirements concerning reliability [IEEE95]. A proliferation of software reliability models have emerged as people try to understand the characteristics of how and why software fails, and try to quantify software reliability.
Over models have been developed since the early s, but how to quantify software reliability still remains largely unsolved. Interested readers may refer to [RAC96] , [Lyu95]. As many models as there are and many more emerging, none of the models can capture a satisfying amount of the complexity of software; constraints and assumptions have to be made for the quantifying process. Therefore, there is no single model that can be used in all situations.
No model is complete or even representative. One model may work well for a set of certain software, but may be completely off track for other kinds of problems. Most software models contain the following parts: assumptions, factors, and a mathematical function that relates the reliability with the factors. The mathematical function is usually higher order exponential or logarithmic.
Software modeling techniques can be divided into two subcategories: prediction modeling and estimation modeling. The major difference of the two models are shown in Table 1. Table 1. Difference between software reliability prediction models and software reliability estimation models. Using prediction models, software reliability can be predicted early in the development phase and enhancements can be initiated to improve the reliability.
Representative estimation models include exponential distribution models, Weibull distribution model, Thompson and Chelson's model, etc.
The field has matured to the point that software models can be applied in practical situations and give meaningful results and, second, that there is no one model that is best in all situations. Only limited factors can be put into consideration.
Defining and Analyzing Operational Profiles. The reliability of software is strongly tied to the operational usage of an application - much stronger than the reliability of hardware. A software fault may lead to a system failure only if that fault is encountered during operational usage.
If a fault is not accessed in a specific operational mode, it will not cause failures at all. It will cause failure more often if it is located in code that is part of a frequently used "operation" An operation is defined as a major logical task, usually repeated multiple times within an hour of application usage.
Therefore in software reliability engineering we focus on the operational profile of the software which weighs the occurrence probabilities of each operation. Unless safety requirements indicate a modification of this approach we will prioritize our testing according to this profile. ALD will work with your system and software engineers to complete the following tasks required to generate a useable operational profile:.
Test Preparation and Plan. Test preparation is a crucial step in the implementation of an effective software reliability program. A test plan that is based on the operational profile on the one hand, and subject to the reliability allocation constraints on the other, will be effective in achieving the program's reliability goals in the least amount of time and cost.
Software Reliability Engineering is concerned not only with feature and regression test, but also with load test and performance test. All these should be planned based on the activities outlined above. The reliability program will inform and often determine the following test preparation activities:. Software reliability engineering is often identified with reliability models, in particular reliability growth models.
These models, when applied correctly, are successful at providing guidance to management decisions such as:. The application of reliability models to software testing results allows us to infer the rate at which failures are encountered depending on usage profile and, more importantly, the changes in this rate reliability growth. The ability to make these inferences depends critically on the quality of test results. It is essential that testing be performed in such a way that each failure incident is accurately reported.
ALD's software reliability engineers will work with developers, testers and program management to apply an appropriate model to your failure data. In order for the model prediction to be useful we must ensure that the assumptions and structure of the model coincide with the underlying coding and testing process.
It is not sufficient to find a mathematical function that best fits the data. In order to infer future failure behavior it is crucial that the underlying assumptions of the model be understood in terms of program management and progress towards release. This requires experience working with the software reliability models as well as understanding of latent issues in the development and testing process that may impact the test data. Software Safety.
As systems and products become more and more dependent on software components it is no longer realistic to develop a system safety program that does not include the software elements. Does software fail? We tend to believe that well written and well tested safety critical software would never fail.
Experience proves otherwise with software making headlines when it actually does fail, sometimes critically. Software does not fail the same way as hardware does, and the various failure behaviors we are accustomed to from the world of hardware are often not applicable to software. However, software does fail, and when it does, it can be just as catastrophic as hardware failures.
Safety-critical software. Safety-critical software is very different from both non-critical software and safety-critical hardware.
The difference lies in the massive testing program that such software undergoes. What are "software failure modes"? Software, especially in critical systems, tends to fail where least expected. We are usually extremely good at setting up test plans for the main line code of the program, and these sections usually do run flawlessly. Software does not "break" but it must be able to deal with "broken" input and conditions, which often cause the "software failures".
Setting up a test plan and exhaustive test cases for the exception code is by definition difficult and somewhat subjective. Often the conditions most difficult to predict are multiple, coinciding, irregular inputs and conditions. Safety-critical software is usually tested to the point that no new critical failures are observed.
This of course does not mean that the software is fault-free at this point, only that failures are no longer observed in test.
Why are the faults leading to these types of failures overseen in test? These are faults that are not tested for any of the following reasons:. Use of this web site signifies your agreement to the terms and conditions. Reliability modeling for safety-critical software Abstract: Software reliability predictions can increase trust in the reliability of safety critical software such as the NASA Space Shuttle Primary Avionics Software System Shuttle flight software.
This objective was achieved using a novel approach to integrate software-safety criteria, risk analysis, reliability prediction, and stopping rules for testing. This approach applies to other safety-critical software. The authors cover only the safety of the software in a safety-critical system.