Monday, June 11, 2007

Software Reliability

Primary reliability factors for software have been reported by Sally Dudley to be [1]:

1. Defects in the code
2. Defects in the interfaces with other code
3. Operational defects which cause changes to defect-free code

We can add to this list sustainability. Sustainability is an attribute which defines the life-time of a software system and its continuity. Once a software system becomes obsolete, its reliability could be greatly impacted. This makes common sense, a software system which is operating on an obsolete hardware system that might no longer be supported by its manufacturer would have a large impact on the reliability of the software.

Sustainability of software systems could occur for various reasons. Sandborn and Plunkett summarized three main causes of software obsolescence [2]; functional, technological and logistical factors.


References:

[1] W. Ireson, C. Coombs, "Handbook of Reliability Engineering and Management", Mc Graw-Hill, 1988.

[2] P. Sandborn and G. Plunkett, "The Other Half of the DMSMS Problem - Software Obsolescence", DMSMS Knowledge Sharing Portal Newsletter, vol. 4, issue 4, pp.3 and 11, June 2006.

Wednesday, July 26, 2006

Reliability Engineering - Social Application

The term engineering is the application of the concepts of science to solve problems. Reliability engineering is applying the concepts and methods of reliability to continously improve a system's reliability throughout its life-cycle.

A practical example:
Let's consider a worshipper, who recognizes that he committed a sin. His conscience tells him he needs to improve his worshipping commitment, one way of performing so is by applying ways to improve his reliability (i.e. reduce the number of times he sins per unit time, lets say a month). He decides to reduce the amount of time he spends chatting with others so that the probability that he talks about someone from behind their back is reduced and hence, his reliability as a worshipper improves.

What the worshipper did above was change a process he is involved with, the process of chatting.

Similiarly in engineering systems, reliability engineering is an integral part of the system's design process, some of the methods of improving reliability are,

A. During the Design Phase

1. Design changes
2. Level of authority and constraints on the design discipline
3. Design reviews carried out by designers, reliability engineers, quality engineers, production engineers, customers, marketing department, test teams
4. Redesign
5. Protype testing under enough amount of time
6. Perform reliability estimation, prediction and growth plans
7. Document reliability specifications for system components
8. Document causes of reliability degradation

B. During the Production Phase
1. Reliability and quality production testing
2. Verification and Validation
3. Perform statistical analysis of test data
4. Provide assistance to production, quality assurance and purchasing

C. During the Support Phase
1. Field failure report systems

MIL-STD-785 provides a description of a reliability program

Definitions

Reliability is most widely defined as "the ability or capability of a system to perform the specified function in the designated environment for a specified length of time".

As a result the life of a system can not be determined except by running or operating it for a desired time.

Let's look at a real example:
A client of mine who owns a travel business called me this morning, he was disappointed that a new hire in his firm was 2 hours late yesterday and today did not show up. This disappointment is basically my client's feeling that the new hire is not reliable. During the relatively short amount of time the employee joined the firm, he failed to meet expected a capability (showing up on time). These two events dramatically undermined the employee's reliability. If the employee had been late after a year of joining my client's firm instead of a few week, his reliability rating would have been much higher.

Conclusion:
That is why first impressions count most, and should be given consideration.

In engineering terms this is what we call Mean Time to Failure, i.e. the amount of time it takes to fail, from an non-failed state, on average.

Mean Time Between Failures is another important characteristic, it tells us how much time it takes to fail again after the previous failure, on average.

Failure rate indicates how frequently a system fails per unit time, typically identified by the greek symbol "lambda"

System Reliability

System reliability plays a significant role in the system's value proposition. Reliability is one of the several main variable I consider when applying the 3DVP method I developed back in 2003.

Reliability continues to be one of key elements in competition, on all levels. Business competition, international competition, or otherwise.

This blog will explore reliability in more detail. It is open to everyone interested in reliability to pitch in. If you are involved in reliability of software applications, hardware systems, transportation, security, healthcare, education, employees or any other area you are welecome..