Reliability Edge Newsletter

Volume 10, Issue 1

Reliability Edge Home

Different Analysis Scenarios for Examining Repairable Systems

 

This article reviews the different analysis scenarios that can be used when examining the reliability of repairable systems. Five different methods will be reviewed, specifically:

  1. Using the mean value of the times between failure (TBFs) at the system level.
  2. Using the distributions of the components of the system and creating a reliability block diagram (RBD).
  3. Using the distribution of the times between failure (TBFs) at the system level.
  4. Using the system level failure data and fitting the non-homogeneous Poisson process (NHPP) model.
  5. Using the system level failure data and fitting the general renewal process (GRP) model.

The following procedure will be used to create failure data for a hypothetical system that can be analyzed with all five of these methods in order to compare the analysis results.

  • A hypothetical system will be created using a reliability block diagram with known component failure distributions. This will represent a system where the "true" reliability and expected failures are known.
  • The repair duration will be considered negligible for the purpose of this comparison.
  • A simulation will be performed on this system for a defined mission duration and the failure events during the simulation will be obtained (including the time and the component responsible for each failure).
  • Two more such simulations will be performed, for a total of three simulations with different mission durations. This would represent three fielded systems of different operational times/ages.
  • The results from these three simulations will be analyzed with all five analysis methods so the results can be compared.

Simulation

For simplicity, we have chosen to use a race car as our hypothetical system. The system is broken down to two major subsystems:

The Front Assembly is composed of:

  • Front Brakes
  • Front Suspension

The Rear Assembly is composed of:

  • Rear Brakes
  • Rear Suspension
  • Engine
  • Transmission

Table 1 shows the failure distributions and corresponding parameters that were set to represent the "true" reliability behavior of these components. Figure 1 shows the RBDs for the system.

Table 1: "True" failure distributions for the components in the system

Figure 1: RBDs for the system

Three simulations were performed based on the RBD shown in Figure 1 and the distributions given in Table 1. The mission time of the first simulation was 2500 km, the second was 1976 km and the third was 800 km. In addition, and to make the simulation more realistic, preventive maintenance was performed on the brakes every 305 km. The RBD is intended to represent an actual system that operates in the field and each of the three simulations represents one fielded system. The distributions that will generate the failures are known, but we will pretend that we do not know them and try to estimate the reliability using the five analysis methods mentioned in the introduction. The preventive maintenance policy was added in order to replicate a more realistic scenario of systems operating in the field.

Table 2 shows the events obtained from the three simulations. As shown in this table, the event time was recorded along with the corresponding component that initiated the event. (Note that the event times are in a cumulative scale.) We will now assume that this is all we know about the design and that each simulation represents a system in operation, where System 1 has operated for 2500 km so far, System 2 for 1976 km and System 3 for 800 km.

Table 2: Simulation results

 

Approach #1: Using the Mean Value of the System's Times Between Failure (TBFs)

With the first analysis approach, the objective is to estimate the MTBF of the design (system) and, based on this estimation, possibly make predictions about future events. Under this model, we look only at the times between failure (TBFs) for each system, which are shown in Table 3. In this table, all of the preventive maintenance events have been removed, since they do not represent failures.

Table 3: Times between failures (TBFs) for each system

 

There are two ways to utilize the TBF data of Table 3 in order to obtain an MTBF. The first is to simply sum all the system ages and divide by the total number of events, or:

(1)

 

where:

  • S = total number of systems.
  • Ti = age of the ith system.
  • N = total number of events from all systems.

The MTBF calculated using this approach for this data set is 329.75 km. However, this result could be very misleading, since it assumes a random failure behavior (i.e. constant rate of occurrence of events, aka failure rate). If the system exhibits an aging pattern (wearout) or an infant mortality pattern, this equation will average out all of the TBFs and the mean could be overestimated in the case of wearout or underestimated in the case of infant mortality.

Obtaining the MTBF from the distribution of the TBFs would provide a better estimator. Under this approach, a distribution is fitted to the TBFs that will represent their behavior. The data set for this example is entered in the Weibull++ software and Figure 2 shows the analysis results. Notice that there is one suspension for each system, which is the time between the last event and the current age of the system.

Figure 2: Analyzing the mean value of the TBFs in Weibull++

Under this analysis, the best-fit distribution is the Weibull with beta = 1.10 and eta = 337 km. Based on this distribution, the MTBF can be calculated with the Weibull++ QCP and is found to be MTBF = 324.5 km. This estimate is preferred over the one obtained using Eqn. (1) because the behavior of the TBFs is considered and the estimate is based on a best-fit model rather than assuming a constant rate of occurrence of failures (ROCOF). (Note that a constant ROCOF is similar to assuming an exponential distribution or, more correctly, a homogeneous Poisson process.)

However, caution must be used when selecting this approach, because even though this analysis represents a better estimate than the one given by Eqn. (1), it is still an average, and as it will be shown later, averages work well only when sufficient data are present AND when systems have reached a "steady state" AND when predicting future events.

A second caution is regarding the misuse of this approach. In many cases, analysts misinterpret this model as being the failure distribution of the system, and they perform additional estimations, such as reliability, BX calculations, etc. These types of results are incorrect since the model simply describes the behavior of the TBFs and, in essence, it is a model of the MTBF. In addition to the MTBF, we also can use this model to obtain, for example, what percentage of the TBFs falls within a given time range, but this does not represent a reliability/unreliability calculation. For example, if we compute P(t = 200 km) for this model, we get 43%. This does not mean that the probability of failure of the system is 43%. Rather, it means that 43% of the TBFs were in the order of 200 km or less. Obviously, this is very different from a reliability/unreliability statement. Also notice that this statement is invariant of the chronological order of the TBFs and the 43% of the TBFs that are less than or equal to 200 km could have occurred at the beginning of the life of the system, at the latter stages, or just randomly. All we can tell from this model is that we expect 43% of the TBFs to be less than or equal to 200 km.

The following graphic demonstrates this point. It depicts a chronological order of the failure events of a system. In the graphic, Ti represents the cumulative time to event and ti represents the times between events. In addition, the vertical line represents the time when the system has accumulated 200 km of operation and all of the times between events that are less than or equal to 200 km are contained within the circle.

If we were to estimate the reliability at 200 km, then it would be defined as the probability that the system will operate for 200 km without a failure. It easily can be seen from the graph that this is different from the percentage of TBFs whose order of magnitude is less than or equal to 200 km (circled events). The percentage of events included in this circle is what was calculated previously to be 43%. Therefore, we can conclude that reliability predictions are not valid with this model. However, the model can be used in order to predict the expected number of failures (ENOF) over time by:

Table 4 provides the estimated number of failures based on the above equation, at different system ages and using the calculated MTBF of 324.5 km. The estimate is compared to the "true" number of failures, which is determined from the original distributions and RBD.

Table 4: Expected failures, using MTBF

 

From this table, we can see that the difference between the estimate and the "true" value improves at higher ages. This is expected since averages are more suitable as system age approaches steady state and time reaches infinity.

Approach #2: Using the Distribution of the System's Times Between Failure

The second analysis method that we will consider is based on the approach used previously, with the exception that it uses the actual distribution of the TBFs instead of the MTBF. This can be done easily using the BlockSim software. A single block is created in BlockSim with a failure distribution obtained from the TBFs. In this example, we obtained a Weibull distribution with beta = 1.10 and eta = 337 km. Since this is a repairable system, a repair distribution also is needed in BlockSim. Since we are ignoring the downtime in this example, the corrective maintenance duration is set to zero.

Under these settings, we run multiple BlockSim simulations for different system ages and we record the results. In this case, the metric of interest is the expected number of failures (ENOF), as shown in Table 5.

Table 5: Expected failures, using TBF distribution in BlockSim

 

From this table, we can see that little improvement is achieved using this approach for this example. However, it is a useful approach in case we need to model large-scale systems composed of multiple repairable subsystems.

Approach #3: Using Component Distributions and RBDs

In this case, we determine the individual failure distributions of the components from the data. This is done by obtaining the times between failure for each individual component. For example, Table 6 gives the TBFs for the engine and Figure 3 shows the Weibull++ analysis to obtain the failure distribution of the engine based on this data set.

Table 6: Engine TBFs for each system

Figure 3: Using Weibull++ to obtain the TBF distribution for the engine

The distributions of all the components can be determined in a similar manner, and are given in Table 7.

Table 7: Component failure distributions

 

It should be noted that for the brakes, all of the preventive maintenance actions were considered as suspensions when building the model. In addition, the data from all the rear brakes were considered as one data set (regardless of the side), and similarly for the front brakes.

It can be seen that this method requires sufficient failure information at the component level. If component failures are scarce, then it would be difficult and possibly inaccurate to implement this method.

These distributions were entered in BlockSim and simulations were performed for different system ages. A preventive maintenance policy for the brakes was included in the model as well. The results are given in Table 8.

Table 8: Expected failures, using component distributions in BlockSim

 

Approach #4: Using the NHPP Model

With the fourth analysis approach, the individual cumulative times to event for each system are considered and the NHPP model with a power intensity function is fitted to the data. The model is given by the following equation:

(2)

 

where:

(3)

and:

  • Pr[N(T)= n] is the probability that n failures will be observed by time T.
  • L(T) is the failure intensity function (ROCOF or failure rate).

This model is an extension of the homogeneous Poisson process, in which the failure rate is assumed to be constant (i.e. exponential distribution). In the case of the NHPP, however, the failure rate could be increasing, decreasing or constant (as in the Weibull distribution), based on the value of beta in Eqn. (3). The assumption of this model is that, after each failure, the system is restored to the same condition it was in prior to the failure ("as-bad-as-old"). This assumption is sufficient when dealing with large systems; however, it becomes less applicable for smaller systems (fewer components) where a replacement has a significant impact on the system (renewal).

Using the RGA software, the NHPP with a power law intensity function is fitted to the cumulative failure times of each system (recorded as mileage in this example). As shown in Figure 4, the beta is 1.65, which indicates an increasing ROCOF for this system/design (i.e. wearout). In other words, as the systems age, more events are observed and the TBF intervals decrease. The expected number of failures at different ages can be computed based on the model and the results are given in Table 9.

Figure 4: Analyzing system level data with the NHPP model in RGA

Table 9: Expected failures, using the NHPP model

 

Approach #5: Using the General Renewal Process (GRP) Model

The last approach, using the General Renewal Process (GRP) model, is an improvement to the NHPP approach. As mentioned in the previous section, the NHPP model assumes that the system is "as-bad-as-old" after each failure (i.e. in the same condition as it was prior to the failure). The GRP model relaxes this assumption by including an additional parameter, q, which is a measure of the degree of restoration (renewal) and is determined from the data. The data set used is the same as the one used with the NHPP approach, i.e. cumulative times to event of each system.

The GRP model is fitted to the data using Weibull++, as shown in Figure 5. The results are given in Table 10.

Figure 5: Analyzing system level data with the GRP model in Weibull++

Table 10: Expected failures, using the GRP model

Conclusions

In this article, five different analysis methods were used to model the failure behavior of a repairable system. The data were generated using simulation based on predefined failure distributions. The expected number of failures was used as a metric for comparing the results from each analysis approach to the "true" behavior of the system (which is known, since the generating failure distributions are known). The table and plot in Figure 6 compare the results of the five different methods.

Figure 6: Comparing the results of the five analysis methods

It can be seen that the RBD simulation approach offers the more realistic estimates in this example. Of course, the estimation always depends on the number of observed events, and this is why the analysis method should be chosen based on the available information. If, for example, very few failure events had been observed, the RBD simulation approach based on component distributions would be very hard to adopt, and most likely a bad estimator. The simulation using the system’s TBF distribution could offer a better estimator when dealing with few failures at the component level, but it typically becomes a good predictor when extrapolating to longer system ages. In addition, this method cannot be used for reliability/unreliability calculations. The MTBF method is recommended only for quick, "back of the envelope" calculations since the simulation based on the TBF distribution is similar and slightly more accurate. Finally, the GRP model is typically more accurate than the NHPP model (which is actually a special case of the GRP). Even though it is more complicated, the GRP is recommended over the NHPP. Therefore, the following recommendations can be made:

  1. When sufficient data are available, the RBD simulation based on the component distributions method should be preferred.
  2. The GRP and the simulation based on the system’s TBFs distribution should be the next options to consider. The choice between the two is dependent on the desired metrics. The advantage of the GRP model is that more metrics can be computed, as well as confidence bounds.

In addition to these recommendations, this article can be used to further understand the assumptions behind each analysis method, the data required and the type of results that can be obtained.

End Article