The root cause just isn’t that important

By Tor Idhammar • on August 27, 2009 • 23 Comments

Root cause analysis and root cause failure analysis are commonly used terms. I have always felt that these terms are somewhat misguided. I say so for some really simple reasons.

First, there is not really such a thing as a “root cause” to a problem. If you try to find a definition for “root cause”, you will find a mix of homegrown attempts, but all of them are general or unclear in nature. Here is an example: “A root cause is an initiating cause of a causal chain which leads to an outcome or effect of interest”. Aside from being wrong, it is quite a bunch of incomprehensive verbiage. The problem with definitions such as these is that it is never, in the real world, possible to prove a single event that solely initiates a whole chain of other events. That is because there are always other events before the so-called “root cause event”. This may seem like semantics, but for problem-solvers, it is important to keep in mind that there never is a silver-bullet answer.

Second, is the root cause really that important? In my opinion, the process we call root cause failure analysis should be used to implement solutions. That is the whole idea, isn’t it - to find and implement SOLUTIONS? If we think logically in reverse and ask, do we always have to know the root causes to find great solutions? Absolutely not!

An example: A mill has problems with failing bearings in most of its rotating equipment. After a quick look, we find out that equipment isn’t aligned, there are no lubrication routes set up, no clean oil storage, no sealed storage for spare bearings. Do we need to do a root cause on each bearing and find out the exact root causes of each one? No. Let’s not spend time and money on the root cause of hundreds of bearings. Let’s work on solutions that we know will improve the problems. Sure, there may be other contributing factors, but the above will be the most pressing.

Many people that get excited in root cause become too detail oriented and lose sight of the big picture and the economics of things.

So, in summary, there is no such thing as “a root cause”, and plant/mill people need to focus more on implementing solutions based on a practical root cause analysis.

IDCON calls our approach root cause problem elimination to shift the focus more toward implementation of solutions based on root cause analysis studies.

What are your thoughts, comments and questions related to this article on the subject of root cause in general? Reply to this blog post and let me know!

Share/Save/Bookmark

Related posts:

  1. Rigor (to the point of mortis)? I really enjoy the writings of Malcolm Gladwell. Both of...

If you enjoyed this post, please consider leaving a comment and subscribing to the email alerts or RSS feed.

Filed Under: Featured, Maintenance Excellence    |  Tags: , ,


Comments

(1)

By Ronald L. Hughes on August 28th, 2009 at 8:45 am

Root Cause Analysis is definitely a viable and very valuable tool for uncovering unresolved issues that lead to eventual failures. In the example given of recurring bearing failures, alignment issues, storage issues, and lubrication issues have been identified as the cause of the undesirable events – how interesting and how totally wrong. I agree that an analysis of each bearing in this case is worthless. However I think I would want to do a Root Cause Analysis on why the poor maintenance and storage practices were allowed to exist in the first place. After all, alignment and lubrication is going to have a negative affect on a lot more than just bearings.

Taking it a bit further, if the negative practices are eliminated and the failures are still occurring something else is occurring. For example there could be forces causing asymmetrical issues within the bearing itself. Where are these forces coming from? Has anything in the system or operations of the mill changed? Was a Management of Change done or done properly?

Root Cause Analysis if done properly is a great way to uncover all the issues that are causing failure. This includes much more than just the physical level of the failure mechanism. If done correctly, RCA would also identify inappropriate human interventions as well as the reasons why these interventions were taken. This solves much more than just the bearing failures and pays for the analysis effort over-and-over again.

Ronald L. Hughes
Reliability Center, Inc.
Senior Reliability Consultant

(2)

By Bob Latino on August 28th, 2009 at 12:32 pm

Hi Tor. As a fellow provider I will agree that the use of the term “root cause” is so diluted in today’s environment it is meaningless. If we take the word to its literal sense then we would take all of our analyses back to Adam and Eve which we both agree is very non-value added. Given that, the question becomes where should our analyses stop?

To excerpt a quote from your post “An example: A mill has problems with failing bearings in most of its rotating equipment. After a quick look, we find out that equipment isn’t aligned, there are no lubrication routes set up, no clean oil storage, no sealed storage for spare bearings…Let’s work on solutions that we know will improve the problems.” In the short-term again, I would agree.

However, if we do not drill down deep enough to uncover the system deficiencies that contributed to the failure, then just fixing the physical side may not make the problem go away. If we go ahead and align the equipment properly, then if the person who misaligned it in the first place is not trained in how to do it right, they will do it wrong again. If the procedue in place was obsolete, someone else will follow it next time. If the alignment tools they are using are worn or not calibrated, it will be done wrong again. Where were the oversight systems (QA/QC [certifications], vibration monitoring results, etc.) that should have caught the failure before it occurred and we suffered component losss? Was the person aligning trained in how to do so properly or was it just on-the-job training? If it was a training issue, that is an underlying system that affects more than just alignment training.

On the other issues the same applies. Why were there no lubrication routes? Yes we can recognize they are not there and then put them in place, but the underlying system is still flawed. If the underlying system allowed that condition to exist without rectifying it, it will happen again (and likely is happening currently) somewhere else in the facility. This is like finding out that P&ID’s are not current when investigating a single failure. If we dug that deep to find the P&ID issue, you should check other P&IDs also as they may possess the same flaw. The system of not updating P&IDs is the problem, not just the set of drawings in the case of the one failure.

Unless we at least drill deep enough to uncover the system’s contributions to a failure, then the failure will likely recur somewhere else in the facility. I could care less what we label wherever we stop the analysis. Our goal as analysts should be to prevent recurrence by implementing effective and timely interventions as well as distributing lessons learned to others who could be faced with similar conditions and prevent the failure next time.

Robert J. (Bob) Latino
CEO
Reliability Center, Inc.
http://www.Reliability.com
http://www.Proactforhealthcare.com
804.458.0645 Tel
804.452.2119 Fax

(3)

By A K Chakravarti on August 29th, 2009 at 5:02 am

RCA is a tool for Proactive Maintence helping it to eliminate/discourage continunce of same defects . It is apparent that where defects are many and simultaneous , RCA may be futile .I wouldn’t recommend RCA for my shop if it is not already inthe Proactive mode .

(4)

By Mech Wan on August 29th, 2009 at 8:33 pm

As a maintenance man for over 15years, root cause analysis is definitely important to identify source of the problem. However as a man of action i careless about what the defination of the root cause itself but the action to solve the breakdown issue is more important. My case, root cause analysis is carried out normally on abnormal failure on certain item that challenge our years of experience maintaining that particular item on top of good maintenance practice that usually lasts longer. Some bearings last for a very long time (the low speed one) till you forget that the bearing exist there.

(5)

By BN Shivakumar on September 2nd, 2009 at 4:54 am

Dear Mr. Tor,
It is very interesting to read your opinion. Nice to say RCA is not required when the failure reasons are very obvious and visible, like the examples you have mentioned. Does it end there? The answer is No.
When the plant is new and the equipment are covered under Warantee, the failures are very less. Just after the warantee time, for a failure one would do RCA. With increased number of failures and with the kind of observation you mentioned, deviating from RCA is just not right. I agree that one should not have a blind approach of RCA for everything. Because we would end having the same cause appearing again & again and start feeling that the exercise is a wastefull effort. Alternately one should start looking into those obvious reasons mentioned. Why the equipment isn’t aligned, no lubrication routes set up, no clean oil storage, no sealed storage for spare bearings etc. Here the failure is not that single item “Bearing”, it is the ‘Managing systems’ which has falied. A root cause analysis is required for the failure of Managing systems. Further it is well defined by Mr. Bob and I can only endorse over it. Best of Lucks.
Vice President - Plant Engg & Projects.

(6)

By Robert Schindler on September 3rd, 2009 at 4:36 am

Tor, I see your point. RCA can be time consuming and resource intensive so it needs to be used when it makes economic sense and probably after you have checked that the basics are being done first. Reliability professionals recognize many problems at sight as they walk about a client’s mill and these low-hanging fruit items should be addressed immediately. It takes time to gather the data and perform an in-depth RCA so we want to be attacking the obvious problems first. In this economy, you play the triage game so your resources stretch the farthest for the best return on the effort and investment. Once the simple is addressed, you free up the resources to focus on the complex.

(7)

By Bob Latino on September 3rd, 2009 at 4:47 am

Robert, I understand your intent and agree to a point. To me, what you describe is troubleshooting and not Root Cause Analysis (RCA). Only failures that have met the criteria for conducting an in-depth RCA should be exposed to its rigors based on the extent of their consequences.

Troubleshooting tends to be done by an individual with primary emphasis on getting back to stable operations ASAP. Troubleshooting does not tend to focus on identifying contributing factors, implementing interventions and preventing the recurrence. The 5-Whys is a common troubleshooting tool for this purpose.

Good troubleshooters tend to get that good because they get so much practice (fixing the same thing over and over again). Where it used to take them 4 hours to get the operation back on line it now takes them 1 hour and they are a hero.

A good RCA analyst will ensure the troubleshooters do not get so much practice in real time!

Robert J. (Bob) Latino

CEO

Reliability Center, Inc.

http://www.Reliability.com

http://www.Proactforhealthcare.com

804.458.0645 Tel

804.452.2119 Fax

(8)

By John Yolton on September 3rd, 2009 at 10:13 am

Tor,

Excellent subject. Good analysis. Your comments remind me of an old saying…”analysis paralysis”. We love to analyze.

Unfortunately we humans get so caught up in the process (RCA), because it’s fun with little, if any, accountability except to keep score, that we forget about the product of the process (Elimination of cause), because it’s work with serious accountability.

fwiw,
John

(9)

By sense_maker on September 3rd, 2009 at 10:50 am

When I read a statement, such as ” …do we always have to know the root causes to find great solutions? Absolutely not! ” it makes me sick.

The actual reason ( using author’s example) that we are able to suggest sometimes a solution is the fact that somebody before already DID a RCA and found out that one of the factors causing bearing failure is misalignment. Let’s proceed using author’s example. If this condition is confirmed, then the user will just go ahead and align the shafts. But he/she may find out later that bearings are still failing. Then he/she needs to drill deeper and come up with a cause that no one have faced before - a real RCA process. So, RCA process still may be needed at some point. Identication of common problems is part of RCA, and the user has to utilize the previous knowledge.

So, if the author does agree with the above, his statement is just unfortunate combinations of words.

Dave

(10)

By Joe Petersen on September 3rd, 2009 at 8:07 pm

I agree with the comments that looking at an individual bearing failure on an isolated machine is probably not a good use of scarce resources, unless that machine has had repeated bearing failures or the bearing had just recently been replaced.

Many sites do not accumulate the information to do root cause analyses when they are appropriate.

Case 1: Operations reports, “pump broke”. When the repair is made the technician writes in closing the work order, “fixed pump” rather than writing “replaced bearing” or better, using the failure codes in the EAM/CMMS and recording a bearing failure.

If repairs are coded as bearing failures, then over time, bearing failures can be explored as a group for root cause analysis which might trigger doing alignment.

There’s no doubt doing alignment at installation is the best practice, but it’s not done everywhere or followed consistently even where it is the best practice procedure.

Case 2:another issue in making root cause investigations more efficient is not having standardized nomenclature, hierarchy, and classification of assets. It shouldn’t take much effort with an EAM/CMMS to find out how many failures have occured on, for example, centrifugal pumps with some specification like flow rate, duty, etc.

(11)

By Sanjay Kulkarni on September 5th, 2009 at 11:41 pm

The language and tone of the the article, I feel, is more political than technical. Politicians often use this tactic - write against something to garner support for it.

An honest debate could have been initiated with more straightforward language.

- Sanjay Kulkarni

(12)

By Jinsong Zhao on September 18th, 2009 at 7:10 am

I agree with Sanjay.

Root cause analysis is based on a system with boundary. If your system is defined, then RCA can not mislead and it can not whatever go far back to adam and eve. In fact, it is very important and useful methodology which has been proven through practices.

(13)

By Kassim El-Maawiy on September 24th, 2009 at 2:46 am

Dear All,

As a new comer into the field of reliability engineering, I have read with interest to all opinions above. It is very interesting to see that there are a lot of different opnions from the same group of people. However, is RCA the right tool to all type of assets? I am working within railway environment and our system and sub-system assets are everywhere within the net work. How can RCA works when you have assets from different batch which are installed at different period (during breakdown)? Sometime you have same type of assets failing but with different type of functional failure at a different condition.

(14)

By Bob Latino on September 24th, 2009 at 7:24 am

As a newcomer I can understand your confusion when you are inundated with so many impressions of what RCA is all about. Let me try and give you some hope that success is around the corner. Forget all the providers and labels that people put on RCA (even me as a provider!)

RCA is the same process a detective employs to understand how a crime occurred. No matter what you hear about RCA, the elements of an investigation are all the same.

1. Disciplined Collection of Evidence
2. Removing Bias from the Investigation Via the Team Members
3. Analyzing the Evidence with the Unbiased Team to Reconstruct the Chain of Events Via a Cause-And-Effect Depiction
4. Communicating Your Findings and Recommendations (Building a Solid Case for Court if you are a Detective and Prosector)
5. Ensure that the Event Does Not Recur by Monitoring the Effectiveness of Your Implemented Recommendations on the Bottom Line

You should not be worried about variability of batch products installed in different locations with regards to the application of RCA. This is because RCA is applied to each singular event and not across a spectrum of events. Common Cause Analysis (CCA) is a more appropriate tool for looking for trends across a population of similar events.

You do not see detectives grouping all grand larcenies into one investigation. Each larceny is unique in its occurrence and must be looked at individually. There may be a “serial larcenist” where some are tied together, but that cannot be determined until the individual investgations are completed and show similarities.

I hope this helps

Robert J. (Bob) Latino
CEO
Reliability Center, Inc.
http://www.Reliability.com
http://www.Proactforhealthcare.com
804.458.0645 Tel
804.452.2119 Fax

(15)

By Thomas Heiserman on October 1st, 2009 at 12:46 am

Tor;

I heartily agree with you….. as a 30 year tradesman and now a MSP consultant, I have seen many RCA’s when everyone is missing the fundamentals. We have to do both, but one without the other is ludricrous. The gentleman who identified “systems failure” is right on target.

My experience with thirteen different companies over my career leads me to believe that a lot of facilities lack even the rudiments of proper systems in the maintenance department : lack of training and control. If you miss the fundamentals, it doesn’t do you much good - unless the RCA points that out and you repair the underlying structure.

Everyone wants to know what went wrong, but the corpse is still a corpse. You have to be on top of the maintenance systems unless you want to continue killing your machinery.

(16)

By Efren Recibe on October 1st, 2009 at 9:44 am

To prevent failure from recurring, the Root cause analysis is very important. Solutions can be easy if we know the root cause of the problem. Initially, its time consuming..but it can ease your job later and save you a lot.

Efren
Lubrication Supervisor
Unisteel KUWAIT

(17)

By Tor Idhammar on October 1st, 2009 at 2:53 pm

Is this what one call “open a can of worms”? Thank you all, great comments, good stuff. Sorry for the people i agitate, it was partly on purpose though! Sanjay, sharp mind on you. Yes, i do write to agitate a discussion, not to be political, i do have a very hands-on point to my article that many of you missed. I exaggerate the example as most of you are pointing out, it is to start the discussion. One of my points is the same as my friend John Yolton brought up (see above).

POINT 1. It is human nature to be comfortable and find it interesting talking and analyze problems (that is what we are doing here, right?), but very few people like to actually do something about problems, take action and fix what we already have identified as problems. My son can sit in the couch and analyze why he won’t clean his room and apologize all day long, but when the rubber meets the road, and he has figured it all out, he is still not cleaning his room. It is the know-do gap.

POINT 2. I think RCA is a great tool, IDCON teaches it to reliability engineers! It should be used! It should be done with DETAIL in care when used! HOWEVER, it is not always the number 1 tool for all problems. If i don’t have clean lubrication storage, grease routes, and basic lubrication practices in my plant. I guarantee that 99.9% of the time, lubrication improvement would give better payback than implementing RCA!!!! I know many of you disagree because you sell this stuff (so do i), but we have to be honest as consultants and see the bigger picture. Another example? If you have the wrong people in the wrong jobs it is ALWAYS more important to fix this problem before implementing RCA. RCA is not always the priority. Just like RCM and other time consuming tools, it should be used, it should be used with detail, but it is not always priority 1 and the best and most economic way to find answers to problems.

Keep the comments coming ;)

Tor

PS
Sense maker, you write “Do we always have to know the root causes to find great solutions? Absolutely not! ” it makes me sick.” It may be a poor selection of words, but what i mean is that we can take ACTION before we know the root cause. Example “my knee hurts (problem), i stop playing hockey (action), i don’t now the root cause, but it is a smart thing to do? Motor is running 250F (problem), start planning the change out (action), smart thing to do? DS

(18)

By Tor Idhammar on October 1st, 2009 at 4:02 pm

An additional note to my last comment (last few sentences above). After you stop skating or after you start planning the change out, root cause may be the right action.

But, to believe that we will do RCA on all problems in a plant is both naive, unrealistic, and misguided. We need to use RCA on the right iproblems, and when we decide to do it, we need to do it with detail and care.

(19)

By Kassim El-Maawiy on October 2nd, 2009 at 2:20 am

Dear Latino,

Thank you for shading some lights on what I didn’t know in the past and I came to know now. Regarding RCA tool as a reliable way of finding unforseen reasons of failures, what is the best way of approaching the system to win the confidence of the few when there is a culture of “we have been there before and it didn’t work” from team work?

(20)

By seshadri on October 4th, 2009 at 11:05 pm

One fundamental nature of failure in human engineered systems is that it is LOCAL. Localization of failure is needed to understand how the failure observed / experienced at the system level. And to understand the progressive (gradual or instataneous) development of conditions
leading to failed system behavior we need open, structured, trackable methodology. Call it RCA, FMEA or whatever. We need to come up with APPLICABLE and EFFECTIVE ( see Nowlan and Heep on RCM - Unite Airlines - 1972) failure preventing measures that make technical and business sense!! That is what RELIABILITY is all about.

(21)

By Bob Latino on October 5th, 2009 at 4:39 am

Dear Seshadri

Thanks for your comment relating RCM into effective RCA. I do not disagree with the premise but I would not personally characterize RCA and RCM as being synonymous terms either. The failure modes derived from RCM usually are at a higher level and may go down as much as 2 or 3 subordinate levels (depending on the variation of RCM used) from the functional failure.

True RCA will drill deeper than 2 or 3 questioning levels deep and will penetrate decision systems. Once we can determine why someone made a poor decision at the time they did, we are not too far from uncovering the systems issues which provided them information that made them believe their decsision was correct. It is these systems (i.e. - policies, procedures, training systems, procurements systems, etc.) that feed informatin to our workforce upon which they make decisions that affect production.

RCM is a very good tool for helping to quantify Criticality which in turn allows us to identify qualified candidates for RCA (prioritization). Tor’s comment earlier about our being naive if we think we can do true RCA (not superficial or shallow cause analysis) on all failures is correct, so we need such quantification tools to help direct our scarce resources for RCA.

RCM is more about identifying Criticality so we can customize a preventive and predictive strategy to improve our operational Reliability. This ensures that we detect a signal of an impending failure earlier and mitigate if not eliminate the potential consequences of the failure.

RCA complements this effort by finding out why the “signal” (i.e. - vibration, temperature, pressure, flow, noise, etc.) appears out of acceptable limits in the first place. In RCA we are not trying to put in an earlier detection system to catch the signal, but to learn why the alarming signal is occurring at all.

These are the reasons that RCA and RCM should work in concert with each other.

Regards
Robert J. (Bob) Latino
CEO
Reliability Center, Inc.
http://www.Reliability.com
http://www.Proactforhealthcare.com
804.458.0645 Tel
804.452.2119 Fax

(22)

By Bob Latino on October 7th, 2009 at 6:06 am

Dear Kassim

Thanks for your post (#19). My only response is:

“We never seem to have the budget to do it right, but we always have the budget to do it again!”.

Regards
Robert J. (Bob) Latino
CEO
Reliability Center, Inc.
http://www.Reliability.com
http://www.Proactforhealthcare.com
804.458.0645 Tel
804.452.2119 Fax

(23)

By V.Narayan on October 21st, 2009 at 4:10 am

Tor,
I believe that being provocative does not mean stating blatantly incorrect positions! Sanjay was spot on when he exposed your ‘political’ objectives. Your recent note confirms you knew you were stating an incorrect position from the start.
1. Bob Latino correctly identifies your initial examples as those of trouble shooting; the results of which are to bring the item back to where it was before the failure, NOT to prevent recurrence.
2. When doing a maintenance audit, the auditor keeps best-practices in mind and compare the actual position with them. The gaps then need to be addressed. Most of your initial examples fall in this category, but I accept your view that reliability will improve as a result. That is how audits work!
3. The term “managing” is about the management of risks. It follows that we focus on high risk systems, events or failures. High risks can occur due to high consequences and/or high frequencies. Thus one big-bang may equate to multiple (very similar) small events. So chronic events merit respect too, when considering the use of risk reduction tools.
4. To echo what BobL said earlier, shallow analysis of the kind you mentioned are okay for low-risk events . If we want to ‘manage’ the business however, we must work on the 20% that contribute to 80% of our losses with appropriate tools.
5. RCA is one such tool. By analyzing a single event (or a set of chronic events) in depth, we uncover systemic defects. Changing a damaged bearing allows us to restart the equipment (trouble-shooting). Understanding its cause(s) allows us to tackle the wider problem of e.g., oil condition or misalignment, so the other (not analyzed) equipment also benefit. That makes the first step-change in reliability. Understanding why there was water in the oil (e.g., improper storage, handling), or why equipment is generally misaligned helps us resolve the systemic problems of storage, handling, alignment-tools or competence attrition. This enables a company-wide improvement in reliability. To trivialize the RCA process ( e.g. your rhetorical question is there a single root cause) is an attempt to divert attention from the thrust of the argument.
6. RCM is another such tool, to help us find the best-practice in respect of what maintenance to do proactively (by PM, PdM or testing), or reactively (run-to-failure), depending on the the risk of failure to the business and the operating context of the equipment.
Vee

Leave a comment

Please keep your comments relevant to this entry. Email addresses are never displayed, but they are required to confirm your comments.