IMHO, The hardest problems to solve are those that are intermittent in nature. We need to reproduce a problem in order to test our fixes. If the problem is not happening, then we can’t easily determine whether a fix will work.

We also need to validate ALL assumptions.

To illustrate the point, I will use an example from Networking, but the same principle applies in other technologies as well.

I may assume the cable modem is not the root cause of an Internet problem because nothing to my knowledge has changed with that modem and in my experience there are a hundred other potential root causes other than a cable modem, so it is not the first device I swap out when trying to find root cause. However, we recently experienced a problem with the cable modem that was causing intermittent slowness. This is where assumptions can delay problem resolution. Network Troubleshooting commonly starts bottom-up. You start with the cable, and work your way up the stack. You assume the cable modem is good, so you skip that and proceed to the next component. However when all else checks out we need to go back to our assumptions and leave no stone unturned =)

Another example from Networking happened recently when web sites started publishing IPv6 DNS records. This caused those particular web sites to start rendering slowly, but only in certain web browsers, on certain operating systems. We assume that web sites don’t change their DNS records often, so we don’t immediately start troubleshooting there. We check everything else first. This is why those problems are the trickiest to solve, and take the longest time to fix. We spend time eliminating all other factors from the equation before we turn to something that has usually been very stable and predicable.

What do you find are the hardest problems to solve? Please post a comment.