What is MTBF?
MTBF (Mean Time Between Failures) refers to the average time between one failure and the next failure of a device under normal operating conditions. This indicator is usually used to evaluate the reliability of equipment, measured in hours. MTBF is mainly used for repairable equipment and is one of the important reference tools for developing maintenance plans. Understanding MTBF has a significant impact on the reliability and availability of equipment.
Reliability and availability
Reliability refers to the ability of a device to perform its expected function without failure under given conditions and within a predetermined time. For example, the main task of an airplane is to safely complete the flight mission and safely deliver passengers to their destination without serious malfunctions. Therefore, the reliability of an aircraft means that it can successfully complete tasks without any malfunctions.
Availability refers to the probability that a device can function properly when needed. Simply put, it refers to the ability of a device to perform its predetermined functions at any time. The availability depends on the reliability of the device and the speed of recovery after a failure. If a device is very reliable but has a long repair time, its availability may be affected.
MTBF and reliability
MTBF is a fundamental indicator for measuring system reliability. Generally speaking, the higher the MTBF value, the better the reliability of the equipment. Reliability can be expressed by the following formula:
Reliability Time Reliability=e ^ - (Time/MTBF)
Among them, e ^ is the base of the natural logarithm (approximately equal to 2.71828), "time" is within a given time period, and "MTBF" is the average time between failures.
Variants of MTBF
In addition to the standard MTBF, there are several similar indicators that are used for more specific situations:
MTBSA (Average System Downtime): Refers to the average duration of system downtime, including all types of downtime.
MTBCF (Mean Time Between Critical Failures): specifically used to measure the average time of critical failures (i.e. failures that cause the system to malfunction).
MTBUR (Mean Unplanned Outage): Refers to the average time of unplanned downtime, usually caused by sudden failures.
These variants help to more accurately describe the performance and reliability of the system, especially when distinguishing between different types of faults.
Calculation method of MTBF
The calculation method of MTBF is to divide the total operating time (normal operating time) of the equipment by the number of failures that occurred during that period.
MTBF=total operating time/number of failures
Assuming you have a warehouse with 40 small components, each of which has undergone 400 hours of testing. The total testing time is 16000 hours (40 x 400=16000). In this test, there were a total of 20 instances where small components malfunctioned.
Calculate total running time: The total running time of all components=16000 hours
Determine the number of faults: Fault frequency=20 times
Calculate MTBF: MTBF=total running time ➗ Number of malfunctions,Namely: 16000 hours ➗ 20 times=800 hours
This means that on average, a malfunction occurs every 800 hours within this group of components. MTBF does not predict the behavior of individual components, but rather predicts the behavior of a group of components.
The "time" in MTBF calculation does not always refer to the actual clock time; It may also be the actual running time of the system. For example, you may have a machine that runs 8 hours a day, and its lifespan may be three times longer than the same machine that runs 24 hours a day. However, the MTBF of the two machines is still the same because their total operating time is equal.
Let's take another example of MTBF calculation. Assuming you have a bottling machine designed to run for 12 hours a day, this bottling machine malfunctions after 10 days of normal operation. In this example, MTBF is 120 hours.
MTBF=(12 hours/day x 10 days)/1 failure=120 hours
If the number of failures increases and the time span of failures is long, then calculating MTBF requires more steps. For example, suppose this bottling machine runs for 12 hours a day and experiences two malfunctions within 10 days. The first malfunction occurred after 20 hours of operation, and the repair time was 2 hours; The second malfunction occurred after 60 hours of operation, and the repair time was 3 hours. To calculate MTBF, we need to first calculate the total normal operating time.
The total normal running time is:
20 hours (running time before the first fault)+18 hours (running time after the first fault repair)+57 hours (running time after the second fault repair).
Therefore, the calculation of MTBF is as follows: MTBF=(20 hours+38 hours+57 hours)/2 failures=57.5 hours/2 failures=57.5 hours.
Potential MTBF issues?
Understanding potential issues is crucial when conducting reliability analysis using MTBF. The calculation results of MTBF may vary due to different definitions of "failure" and "running time", and also depend on whether you are calculating the MTBF of a single device or the MTBF of the entire process.
1. MTBF assumes a constant failure rate
One of the basic assumptions of MTBF is that the failure rate of equipment remains constant throughout the entire operating time. However, this is often not the case in reality. The failure rate of equipment may vary over time and exhibit different stages:
Early Failure Period (Infant Death Period): The equipment may experience a high failure rate due to design or manufacturing defects when it is first put into use.
Accidental failure period: After an early failure period, the equipment enters a relatively stable failure rate stage.
Wear and tear failure period: After a period of equipment use, the failure rate begins to increase due to component aging, wear and tear, and other reasons.
Suggestion:
Use more complex reliability models, such as the Bathtub Curve, to more accurately describe changes in equipment failure rates.
Regularly maintain and inspect equipment to reduce the impact of early failure periods.
2. The definition of operation time is different
The definition of operation time will affect the calculation results of MTBF. Different definitions may lead to significant differences:
Equipment running time: refers to the actual working time of the equipment.
Shutdown time: including the time when the equipment is not in working condition, such as equipment maintenance, standby, waiting for raw materials, etc.
Suggestion:
Clearly define the operation time and ensure consistency and transparency. When calculating MTBF, only the operating time of the equipment under normal working load is included. For example, for a vehicle, only the time it takes to accelerate and run at high speeds is calculated, not including the time it takes to stop at a red light.
3. Select monitoring devices (bad actors)
When calculating MTBF, selecting the monitoring equipment will affect the results. Especially the impact of 'bad actors' (key equipment):
Single device: Performing MTBF calculations for a single device can more accurately reflect its reliability.
The entire process: MTBF calculation is carried out for the entire production process, which can reflect the reliability of the entire system, but is subject to the influence of "bad actors".
Suggestion:
If the purpose is to evaluate the reliability of a single device, it should be tested separately.
If the purpose is to evaluate the reliability of the entire production process, all critical equipment should be included and labeled as' bad actors'. This helps identify key links that affect the reliability of the entire system.
How to improve MTBF?
Improving MTBF (Mean Time Between Failures) is an important means to enhance equipment reliability and reduce maintenance costs. Here are some specific methods and strategies that can help you achieve this goal:
Improve the preventive maintenance process
Preventive maintenance is the key to improving MTBF. By conducting regular inspections and maintenance, problems can be identified and promptly addressed before they occur.
Specific steps:
Training maintenance personnel: Ensure that maintenance personnel receive sufficient training on how to correctly perform maintenance tasks. Provide detailed operation manuals and checklists.
Monitor maintenance effectiveness: Regularly evaluate the effectiveness of preventive maintenance to ensure the effectiveness of maintenance plans. If it is found that certain maintenance measures have not achieved the expected results, the plan should be adjusted in a timely manner.
Conduct root cause analysis
Identifying the root cause of a malfunction can help you take effective preventive measures to prevent the same problem from happening again.
Specific steps:
Record fault information: Record detailed information such as fault time, fault type, and fault phenomenon every time the device malfunctions.
Perform root cause analysis: Use tools such as fishbone diagrams and 5 Why analysis to identify the root cause of the malfunction.
Take corrective measures: Based on the results of root cause analysis, take corresponding corrective measures. For example, replacing low-quality components, optimizing operational processes, etc.
Establish state based maintenance
State based maintenance is a method of predicting and preventing failures by monitoring equipment status in real-time.
Specific steps:
Install sensors and monitoring systems: Install sensors on critical equipment to monitor the real-time operation status of the equipment.
Set alarm threshold: Based on historical data and expert experience, set a reasonable alarm threshold. When the device parameters exceed the threshold, the system automatically issues an alarm.
Timely response: When receiving an alarm, arrange maintenance personnel to inspect and handle it in a timely manner to avoid malfunctions.
Implement Total Production Maintenance Plan (TPM)
TPM is a comprehensive device management method aimed at improving the overall efficiency and service life of devices.
Specific steps:
Full participation: Encourage all employees to participate in equipment management and maintenance work, forming a culture of full participation in maintenance.
Continuous improvement: Through continuous improvement activities, continuously improve the performance and reliability of equipment.
Regular evaluation: Regularly evaluate the implementation effectiveness of TPM and adjust and optimize maintenance plans based on the evaluation results.
Part of the content is quoted from "The Engineer and His Friends"