[IEEE 2008 Annual Reliability and Maintainability Symposium - Las Vegas, NV, USA...

1-4244-1461-X/08/$25.00 ©2008 IEEE

Identifying Factors Influencing Reliability of Professional Systems

Aravindan Balasubramanian, Technische Universiteit Eindhoven Kostas Kevrekidis, Technische Universiteit Eindhoven, Peter Sonnemans, Technische Universiteit Eindhoven Martin Newby, City University London

Key Words: Reliability Estimation, Prediction, Reliability Factors, Failure Analysis, System Performance Improvement

SUMMARY & CONCLUSIONS

Modern product development strategies call for a more proactive approach to fight intense global competition in terms of technological innovation, shorter time to market, quality and reliability and accommodative price. From a reliability engineering perspective, development managers would like to estimate as early as possible how reliably the product is going to behave in the field, so they can then focus on system reliability improvement. To steer such a reliability driven development process, one of the important aspects in predicting the reliability behavior of a new product, is to know the factors that may influence its field performance. In this paper, two methods are proposed for identifying reliability factors and their significance in influencing the reliability of the product.

1 INTRODUCTION

In this high-technology revolutionized world characterized by automation, products have become more complex with numerous hardware and software components put together to perform tasks with speed and accuracy. As a result, the business processes in developing such products are becoming more complex, with increasing pressure on shorter time to market, and customer demands on both the product cost, its quality and reliability. In such a product development environment, estimating and predicting reliability accurately during development is a major challenge since there is often a mismatch between the reliability estimated by the development team and the reliability that is finally observed in the field [1]. This poor "prediction" may be due to many issues, for example, the testing conditions not being representative of the actual field conditions [2], inappropriate assumptions about the field conditions and modeling parameters, inappropriate prediction techniques [1], and relevant factors [2] that determine the reliability performance have not been taken into account.

This paper deals with the issue of identifying factors which determine the reliability performance of professional systems. Knowing these "reliability factors" is important for development since: • Incorporating the effect of relevant reliability factors into

a reliability estimation model enables a more realistic

reliability assessment of the product, i.e., closer to its actual field performance.

• Reliability improvement efforts during development can be focused more effectively based on the effect of the most relevant factors.

Some researches [2] [3] [4] [5] have also suggested and demonstrated how important these factors are in assessing product reliability. Ascher et.al [2] and Wong [3] also point out the shortcomings of ignoring these "real world" explanatory factors [4] [5] in probabilistic modeling. Research work [6] has also been done in the area of software reliability engineering, to identify influential reliability factors with respect to software development processes and its environment that should be incorporated in software reliability estimation. This paper attempts to identify reliability factors in terms of field environment, system (both software and hardware) configuration, and operational profiles of the system, which are influencing reliability. The problem addressed in this paper is how to identify those reliability factors. In the next section the case study that will be used in this paper will be introduced. Section 3 will demonstrate the methods used to identify relevant factors and their significance, while the limitations and future work will be discussed in section 4.

2 CASE STUDY

This research is being carried out in a company which develops the so-called pick-and-place systems that are specially designed for placing electronic components on printed-circuit boards (PCB). They develop different series of products based on middle-end and high-end customer requirements. However for this study, only the high-end series is considered. Systems in this series consist of a mounting base, feeders, pick-and-place robots, PCB-transport and a number of computers and software to control the system and its functions. The robots pick the components from the feeders, and check the alignment of the components using either laser vision or camera vision, and place the components on the PCB which are being fed to the system. The systems in the field are regularly updated with new software releases, once or twice a year. The reliability performance of the system is measured by testing (Beta testing) the systems at

customer sites (reference sites). In-house testing of the system says very little about the reliability performance in the field since system performance depends very much on the operating environment of the system, like production changeovers, environmental conditions at the production floor, system configuration, and system usage, which are not feasible to replicate.

Since reliability of the system is determined by the failures that occur in the system, it is important to define which malfunctions (errors) are considered as failures from the company's point of view. A malfunction is considered to be a failure if it concerns an event that causes the system to shut down or forces the operator to reboot the system (excluding normal shutdowns and power interrupts). Hence the last state that the machine was in, before it was shut down is important, for it gives an indication of whether the shutdown was caused by a failure or not. The states of the system are continuously monitored and recorded by the system itself. The different kinds of failures that cause the shutdowns are startup error, startup crash, startup hang-up, standby crash, productive crash, hang-up, unrecoverable error and repeatable error. These failures are classified into Failure Type groups and are explained in Table 1.

Table -1 Failure Types Failure Type Description Startup (i.e. startup error, startup crash & startup hang-up)

The machine has to be unwillingly shut down before the startup process is finished

Crash (i.e. stand-by crash & productive crash)

Monitoring process detects a software process crash resulting in a stop of the application, which causes the machine to shutdown.

Hang-up Some process crashes but the monitoring process does not detect it, which means that another process is waiting for a signal that is not coming. Nothing is responding anymore.

Failure error (i.e. Unrecoverable error & repeatable error)

An error occurred when a certain action that is necessary for the production cannot be performed. An error that keeps occurring repeatedly.

Unknown Machine is shut down in the state Standby, non-scheduled time or scheduled downtime but the machine is turned on again within 10 minutes. Here the reason of shutdown is unknown but is suspected to be necessary.

At the company, the reliability performance is measured

in Mean Time Between Failure (MTBF) and it is calculated based on the productive time of the system. A preliminary graphical analysis of the field failure data of 16 systems from 5 different customers with the same version of software shows

(see Figure 1) that the reliability performance of systems varies considerably. The MTBF with 90% confidence interval, based on the exponential distribution (since the system is considered to be stable and the times between failures are proved to be exponentially distributed), are also shown in Figure 1. The difference in performance among the systems is found to be significant with an ANOVA analysis. This performance variation makes it harder to predict reliability of systems during development since the performance of the system depends on how it's being used. Hence it becomes imperative to know what causes these differences in reliability performance, and how these causing factors can be taken into account in the reliability estimation procedure to make a realistic assessment to reflect what is observed in the field. It is this problem that drives the objective of this paper i.e., how to identify relevant factors and their impact on system reliability performance. The following section explains how these factors and their significance can be identified by analyzing the variation in reliability performance.

Figure 1 – MTBFs of the systems with 90% Confidence limits

& ANOVA analysis results

3 METHODOLOGY

As a first step in finding these reliability factors, a brainstorming session is organized with experts from different functional groups (e.g. development, software architecture & development, customer support, maintenance, reliability, etc.). The participants are briefed about the objective of this brainstorming and are asked to quote any possible reliability factors with respect to field environment, system configuration (both software and hardware), operational profiles, and operational performance of the system. The factors identified are grouped under four major groups (I to IV) based on their characteristics, which are shown in Figure 2. Due to confidentiality not all factors are shown. From this list, the factors that are significantly influencing reliability have to be determined. There are two methods that will be applied to identify the significant factors, i.e. a questionnaire and failure analysis. Both methods will be discussed in the next subsections.

Figure 2 – List of Factors from Brainstorm

3.1 Questionnaire

In this method, a questionnaire is developed which enables the respondents to rate the factors for their significance in influencing the reliability of the system. The questionnaire is attached with a covering letter which explains the objective of the research and the questionnaire, to help the respondents understand the importance and the purpose of this questionnaire. There are altogether 48 factors (identified from brainstorming) in the questionnaire that are to be rated. The questionnaire uses a 7-point Likert scale (1 indicates a low influence and 7 indicates a high influence) to identify how these factors influence the system reliability according to the experts. Factors which have an average score close to 7 are taken as highly influencing factors as opposed to factors which have their average score close to 1. The questionnaires are sent to the employees in the abovementioned functional groups to collect the ratings and they are asked to fill in the details of their designation and their experiences with the systems on development and reliability related issues and their level of knowledge and awareness on reliability of the system. To remove any bias in the scores that might exist due to people giving biased scores to factors from their functional view or people giving high scores or low scores to all the factors, the Relative Weight Method [6] is used to obtain the final ranking of the factors. This Relative Weight Method normalizes the rating (r) of ith factor given by jth respondent (j) by dividing it by the summation of the ratings of all the n factors by that particular jth respondent. This normalizing removes the bias that is explained above.

i.e.,

1

i ji j n

i ji

rN

r=

=

∑

The overall rating for each factor is obtained by taking the average of the normalized ratings.

1

l

ijj

i

NN

l==∑

, where l is the total number of responses.

The higher the value higher the significance of this factor, in influencing the reliability of the systems.

These rated influential factors are analyzed individually to see whether these factors significantly influence the failures and hence the reliability, by a series of hypothesis testing. The hypotheses are derived by having a subsequent group discussion with the help of experts and the customer support group, based on their experiences with these systems, their failure behaviors and the corrective actions made on those failures. The hypotheses are, for example, whether the number of failure errors is related to the number of manual stops, whether the number of crashes is influenced more by the presence of a camera vision system, and whether the startup failures are influenced by the number of robots in the system. Furthermore, hypotheses, expressing supposed relationships between differences in reliability performance and reliability factors, can also be derived by having more insights into the failure behavior like how the different failure types are distributed among different systems from different customers. This insight can be achieved by doing a simple graphical analysis. For example, Figure 3 shows the startup failure ratio of the systems across different customers. There, it can be seen that systems B2, B3, C2, C3, D2, and D3 have higher startup failure ratios than the rest of the systems. Hence it would be interesting to know what distinguishes the above-mentioned systems from the rest in terms of the reliability factors.

Figure 3 – Startup Failure Ratio of the systems

Further enquiry led to an obvious difference in the number of robots that are installed on the systems which can also be seen in Figure 3 where the number of robots is indicated on top of the bars. There, it can be seen that systems with large number of robots have a large startup failure ratio.

Start-up failures/number of start-ups

B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4

Machines

Rat

io (%

)

11

18

20

5 5

18

20

6 420 20

6

The data from these systems are used to further investigate whether the number of robots is actually influencing the number of startup failures.

The basic goal of these hypotheses here is to determine how significant a certain factor is related to the failures and how significant a certain factor is related to another factor (dependency). As already mentioned, it is not feasible to test for these factors in an experimental set-up since these factors can have various levels and to test them systematically will prove to be too expensive and time consuming. Hence some of these hypotheses are tested using the existing Beta testing data and historical field failure data available. As an illustration, one of the hypotheses tested is presented here. Hypothesis: the number of crashes is positively correlated with the presence of Camera Vision (CV) in the system. Systems use laser vision for measuring the size of the component and aligning them on the PCB. Some systems also use cameras for aligning the component. To investigate whether the presence of cameras causes more crashes, crash data on systems (with and without cameras installed) are collected. To compare different systems for the number of crashes, the Mean Time Between Crashes (MTBC) is a good indicator, since this parameter relates to the production time of the system. This indicator implies that systems with a lower MTBC have more crashes per productive time than systems with a higher MTBC. Thus for the above stated hypothesis, it has to be tested if the MTBC of the systems with CV (μwCV) is significantly lower than the MTBC of the systems without CV (μwoCV). i.e., H0: μwC=μwoCV, Ha: μwC<μwoCV.

The data from the systems with CV are grouped together and likewise for the system without. A one-way ANOVA is performed to find out whether there is any difference in MTBC within the groups and between the groups. The results are shown in Figure 4 and it can be seen that between groups variation is found to be significant. This implies that there is a significant difference between the systems with and without CV in terms of the MTBC. A 90% confidence interval is calculated for the MTBC, based on the exponential distribution, and is shown in Figure 5. There, it can clearly be seen that the intervals do not overlap which again strongly confirms that there is a significant difference in MTBC between the systems with and without CV. Furthermore, it can also be seen in Figure 5 that systems without CV have higher MTBCs than systems with CV. Therefore, we conclude that more crashes occur on systems that have CV installed than systems without. However, it is not proven if there is a causal relation between these CVs and those crashes. There could also be other factors that are causing those crashes. This can be only be revealed when all the factors and their correlation with other factors are analyzed.

In this way, the significance of these reliability factors (that have been derived from the questionnaire), can be validated by hypothesis testing using field data. The correlations between these factors are verified by a correlation analysis. If the factors are correlated then only the main independent factors are taken into account in the model for reliability estimation and the remaining factors that depend on

these main factors are left out as the effect of those factors has already been taken care of by including the main factors.

Figure 4 – ANOVA results

Figure 5 – MTBCs with 90% Confidence limits for systems with Camera Vision and without camera Vision

3.2 Failure Analysis

Significant reliability factors are also identified by analyzing the root causes of the failures. Whenever there is a failure, the system produces an error message with an error (failure) code and with one or two likely causes of the error and the respective corrective action to be taken. These error messages have been pre-programmed in the software and a particular error message is displayed whenever the associated failure occurs. The information from these error messages are used for the failure analysis. The likely causes are not necessarily the root causes of the failure but they are used as a compass to find the root cause in troubleshooting. To derive the relevant factors, first, the failures of the systems are collected from the customer sites and are grouped in terms of previously introduced failure types. Their respective error (failure) codes and the respective likely causes are listed for each failure. For each likely cause, the experts are then asked to list the possible root causes along with the probability that these causes are the real cause of the failure. The experts consulted here are mainly from customer support groups and software developers, as they are the creators of those error messages and they are mainly involved in failure diagnosis. For each probable root cause, the average probability (Avg.p see Figure 6) is taken for further analysis and the reliability factors that are relevant to the root cause are identified from Figure 2. The number of identified factors can be one, two or many, depending on the root cause. The factors are then rated on a 7-point Likert scale by the same expert group to denote how much impact (relevancy) that particular factor has on the root cause (1 indicates a low impact and 7 indicates a high impact on the root cause) and finally the average rating is

Figure 6 –

Figure 6 - Details of Failures with likely causes & Scoring of the Factors

taken (see Figure 6) for further analysis. The significance of the factors is indicated by constructing a score for each factor. The score for each factor is obtained by multiplying the average probability of occurrence of the failure (root) cause with the average impact rating of the factor. The total scores for each of the distinct factors are obtained by summing the scores of that particular factor found in all the instances across all failures, failure types and customers. The higher the score, higher the significance of this factor, in influencing the reliability of the systems. In this way, the influential reliability factors can also be identified. For illustration, one failure type is taken and shown in Figure 6 with the details of the failure, their probable root causes, and the scorings.

This method helps to validate the results obtained by the first method (Questionnaire) and also helps in identifying factors if any have been missed or left out earlier. Furthermore, this failure analysis can be helpful in making decisions about the significance of a factor when the hypothesis testing in the previous section remains inconclusive. Hence this method can act as a complementary method to the Questionnaire. However, there are limitations in using this method. Sometimes it is not easy to track down the exact root cause of the failures. This is because some of the failures could have been caused by one of the last errors or warnings just before the failure or failures could have been caused by combinations of these errors or warnings. Another reason could be the occurrence of a new failure for which there is no information programmed. Failures might also have been caused by hardware failures initiated by software failures and vice versa in which case it is hard to find the real root cause of the failure. Moreover, for crash failure types, this failure analysis method cannot be applied because when the system crashes, it does not give any error code and likely causes suggestions.

4 LIMITATIONS & FUTURE WORK

The scope of this paper is limited to identifying relevant reliability factors that are only related to the field environment, system configuration (both software and hardware), and operational profiles of the system. Nevertheless, the methodology followed in this paper can also be applied to an expanded scope of reliability factors like factors related to development process (design & manufacturing process) and software development process.

Having identified the reliability factors, next challenging

task, that has to be focused on, is how to incorporate the effect of these relevant reliability factors in the reliability estimation and prediction procedures. Current research work is directed towards including the effects of these reliability factors in appropriate reliability modeling and prediction techniques to improve reliability estimation and prediction. Recent works [7] [8] show promising results in using Neural Networks as a prediction tool in reliability engineering. Neural Networks, are more flexible in terms of learning from the data and do not require any assumptions when compared to traditional analytical models. These characteristics coupled with their ability to deal with incomplete or noisy data make it interesting to explore and use them in this reliability prediction. Hence, future work will be aimed towards this direction.

ACKNOWLEDGMENT

The authors would like to thank Mr. J.C. Kerstens for his contribution in data collection and analysis.

REFERENCES

1. Yadav, O.P., Singh, N., Goel, P.S., Campbell, R.I., “A Framework for Reliability Prediction during Product Development Process Incorporating Engineering Judgments,” Quality Engineering, vol.15, No.4, 2003, pp 649-662.

2. Ascher H., Feingold H., “Repairable systems reliability, Modeling, Interference, misconceptions and their causes,” Marcel Dekker, Inc, 1987.

3. Wong K.L., “What is wrong with the existing reliability prediction methods?,” Quality reliability Engineering International, Vol. 6, 1990, pp. 251-257.

4. O'Connor. P., “Testing for reliability,” Quality reliability Engineering International, vol. 19, Issue 1, 2003, pp. 73-84.

5. Denson. W., “PRISM – A Tutorial,” The Journal of the Reliability Analysis Center, Third Quarter, 1999.

6. Zhang Z., Pham H., “An analysis of factors affecting software reliability,” The Journal of Systems and Software, vol. 50, 2000, pp. 43-56.

7. Rajpal. P.S., Shishodia. K.S., Sekhon. G.S., “An aritifical neural network for modeling reliability, availability and maintainability of a repairable system,” Relibaility

Engineering and Systems Safet,. vol. 91, 2006, pp. 809-819.

8. Marseguerra. M., Zio. E., Ammaturo. M., Fontana. V., “Predicting Reliability Via Neural Networks,” Proc. Ann. Reliability & Maintainability Symp., (Jan.) 2003, pp 196–201.

BIOGRAPHIES

Aravindan Balasubramanian, MSc Eindhoven University of Technology P.O.Box 513, 5600 MB Eindhoven, THE NETHERLANDS

e-mail: [email protected]

Aravindan Balasubramanian is a PhD student at Eindhoven University of Technology, in the faculty of Technology Management, where he is doing his research in the field of Quality & Reliability Engineering. His research focuses on predicting reliability of the product during product development process for professional systems.

Kostas Kevrekidis, MSc, Eindhoven University of Technology P.O.Box 513, 5600 MB Eindhoven, THE NETHERLANDS


Kostas Kevrekidis is a PhD student at Eindhoven University of Technology, in the faculty of Technology Management,

where he is doing his research in the field of Quality & Reliability Engineering. His research focuses on field monitoring techniques for capital goods.

Peter J.M. Sonnemans, PhD, MSc Eindhoven University of Technology P.O.Box 513, 5600 MB Eindhoven, THE NETHERLANDS


Peter Sonnemans is an assistant professor at Eindhoven University of Technology, in the faculty of Industrial Design, where he is responsible for research and education in the field of Business Development. He is also connected to Philips Electronics as a senior consultant in the same field.

Martin J. Newby, PhD City University LONDON EC1V 0HB United Kingdom


Martin Newby is Professor of Statistical Science at City University, London. He has worked in industry and was previously lecturer in Industrial Technology at the University of Bradford and Associate Professor of Industrial Engineering at Eindhoven University of Technology. He graduated from Sussex University. He is the author of many papers on reliability and related topics. His research interests include Bayesian statistics, reliability, and maintenance problems.

[IEEE 2008 Annual Reliability and Maintainability Symposium - Las Vegas, NV, USA...

Documents

Transcript of [IEEE 2008 Annual Reliability and Maintainability Symposium - Las Vegas, NV, USA...