It’s funny at times how in-sync the various folks in the community are. I was working with the OnCommand Plug-in from NetApp one day and the next I ran across this article by Marnix: http://thoughtsonopsmgr.blogspot.com/2013/06/high-level-overview-netapp-san.html. So we were both working on the same management pack at the same time on different sides of the world. His article is solid so I won’t re-cover anything he listed, but I will augment something that he stated:

Required: Tuning!
This MP is really good and really appreciated by many of my customers. However, many of the Monitors in this MP are set to zero so those Monitors require some good tuning in order to get the best out of this MP.
Other Monitors use wrong thresholds. This isn’t a bug is done on purpose, forcing you to tune them according your environment. When done this MP will really deliver added value.

How do you tune this management pack? I hoped that you would ask that question. Here’s how we tuned it in my client environment.

 

The high level process that we used for tuning this is as follows:

1) Install the management pack [Marnix covered step 1 in his blog in depth at http://thoughtsonopsmgr.blogspot.com/2013/06/high-level-overview-netapp-san.html]

2) Create an overrides management pack (NetApp_Overrides as an example) [this is done in the administration pane, right-click on management packs and create the new management pack]

3) Wait 24 hours [if large issues occur with the NetApp during the timeframe, 24 hours may not be sufficient. the goal of this window is to establish what is “normal” for the items monitored by this management pack]

4) Tune the alert raised during the 24 hour timeframe – assessing each compared to a baseline for performance information [this is what will be the focus of this blog post]

5) Go back to step 4 until each alert has been addressed

 

Tuning process details:

The goal of this process is really to establish our own baseline for what is normal for the environment so we can tune the alerts which are being raised by the management pack. As Marnix mentioned, there are many thresholds which are set to 0, and others which are at 5-10 ms. As a result, this management pack will generate a lot of alerts using the default configuration. The goal of this tuning process is to ignore alerts which are normal to how the NetApp performs, but to alert when something unusual is occurring.

 

  1. To determine what a “normal” value for an item monitored by this management pack, create an “All Alerts” view in the “My Workspace” pane which shows all alerts that have occurred.
  2. From our testing it appears that most data is collecting once an hour, and in the product knowledge it is recommended to not change the rules which are collecting this data via overrides in Operations Manager.
  3. Highlight the alert, right-click on it and open the performance view. The performance view will show us what the data points look like over the last 24 hours after we set the time range (in the tasks pane) to show data for the last day. Using the example below we can see that the performance counter has stayed at less than 30 during the timeframe that this counter was being monitored. The lower level threshold should be set to just over what values are occurring for the performance counter during a normal business day. The upper level threshold should be higher than the lower level threshold. In this case we doubled the value for the upper threshold to 60.

image

As a result we can configure an override for the object which generated the alert. For this example we set the LowerThreshold to 30 and the UpperThreshold to 60 the graphic shown below (the screenshot was taken after setting the UpperThreshold but before the LowerThreshold was configured).

image

4.There are some alerts which may not have a corresponding performance counter which is shown in the performance view. For those alerts, we can use the health explorer to see what values were occurring as the monitor was changing state from healthy to warning or critical. The screenshot below shows samples of these thresholds. By reviewing each of the different thresholds which occurred we can identify what the normal ranger for this counter is even if we are not gathering the performance counter. In the sample below, we were able to determine what values had been exceeding at least 16.99 as shown below. This threshold is also configured by creating an override which is stored in the NetApp_Overrides management pack we created earlier.

image

5. Go back to step 3 and repeat for each alert (lather, rinse, repeat).

Insight from Marnix Wolf:

I emailed Marnix a pre-copy of this blog post and he had an excellent additional insight. The following is a subset of his thoughts related to this blog post:

”…this MP requires ongoing tuning, simply because a SAN is the most crucial and basic part at the same time of any IT shop. And like any other IT shop, it goes through changes all the time which always have a performance (IO) and storage hit on the SAN. So whenever IT changes, or embraces new policies like thin provisioning many times resulting in overcommitting it, this management pack will notice and raise the corresponding alerts as well. Therefore this management pack not only monitors the technical aspect of your IT environment but covers some political and economic aspects as well (don’t buy a new SAN because we’re running out of storage, but integrate thin provisioning as an example).”

Additional links:

This link provided a starting point for performance thresholds to consider: https://kb.netapp.com/library/CUSTOMER/solutions/1013259/PerformanceAdvisor_Default_Threshods.pdf

 

Summary: Through using the performance view and health explorer state change event details we can determine what normal behavior is for the performance metrics gathered by the NetApp OnCommand management pack and tune the health model to match our environment’s normal environment values.