[Update 11/29/2017: This blog post series has been superseded by a solution built to visualize server and client information which is available at: http://blogs.catapultsystems.com/cfuller/archive/2017/11/28/updating-the-server-and-client-performance-solution-to-the-new-query-language/. Please note that query examples in this deprecated blog post are for the old query language and will not work in the current query language.]
This blog post is part of a series where we will look at Key Performance Indicators for servers and how OMS can be used to work with these KPI’s to determine the health of a server. The blog post series includes:
In Operations Manager we determine the health of a system from a processor perspective based upon the processor utilization level and the processor queue length. Details on this are below from the article on Operations Manager Key Performance Indicators available from Windows IT Pro:
“Critical state occurs when processor utilization (Processor\% Processor Time\ _Total) is higher than 95 percent for 6 minutes (after three samples on a 2-minute schedule) and when processor queue length (System\Processor Queue Length) is greater than 15 for 4 minutes (after two samples on a 2-minute schedule).”
In OMS, it is simple to add both the processor utilization and processor queue length counters. To add these counters we go to the Settings page, on the Data tab and then open the Windows Performance counter section.
The screenshot above shows both of the counters which Operations Manager uses to determine if a system’s processor is healthy are being collected by OMS.
To be able to track these metrics to the level required for the KPI metrics at the top of this article we will need to decrease the sample interval to increase the frequency with which this data is gathered to a 2 minute (120 second) schedule.
(Note: the purple lines to the left of each counter indicate the information needs to be saved. To save these changes click on the Save option in the top left corner of the UI):
Next we need to develop the query which we will use for the processor information. The example below shows both of the relevant metrics for each of the systems which they are being collected for in my labs within this OMS workspace.
Type=Perf (ObjectName=Processor OR ObjectName=System)
To simplify this equation we will update the query to only include this information for a specific system:
Type=Perf (ObjectName=Processor OR ObjectName=System) (Computer=”AllInOneOMTP3″)
From here we can also expand the graphics to show more detail on these objects:
Next we restrict the data to the appropriate time range. For our example we are going to watch these counters for a total of 8 minutes (giving us at least 3 performance samples since they are collected every 2 minutes).
Developing queries for the counters:
As we have in previous blog posts in this series, we need to build up queries for the two counters which we will look at related to processor health. However, to short cut this approach we will build from the free disk space counters discussed in the last blog post.
Querying Processor Queue Length:
To see the last hours’ worth of values we just substitute what we had for another counter such one of the ones used in the free disk space focused blog post.
Type=Perf ObjectName:LogicalDisk AND CounterName:”% Free Space” AND TimeGenerated>NOW-1HOUR | Measure Max(CounterValue) as Counter by Computer
To switch this to the new counter we just substitute the values for ObjectName and CounterName. The resulting query is:
Type=Perf ObjectName:System AND CounterName:”Processor Queue Length” AND TimeGenerated>NOW-1HOUR | Measure Max(CounterValue) as Counter by Computer
If we want this restricted to 8 minutes this would instead be:
Type=Perf ObjectName:System AND CounterName:”Processor Queue Length” AND TimeGenerated>NOW-8MINUTES | Measure Max(CounterValue) as Counter by Computer
Querying % Processor Time:
The same steps result in this query for the % processor time:
Type=Perf ObjectName:Processor AND CounterName:”% Processor Time” AND TimeGenerated>NOW-1HOUR | Measure Max(CounterValue) as Counter by Computer
If we want this restricted to 8 minutes this would instead be:
Type=Perf ObjectName:Processor AND CounterName:”% Processor Time” AND TimeGenerated>NOW-8MINUTES | Measure Max(CounterValue) as Counter by Computer
Now that we have our working queries, most of the hard work is done. We can now visualize this information in the My Dashboard page. To do this we save our queries under a category which we will use throughout this blog series “Server Health”.
Adding dashboard items:
To add dashboard items, go to the top page for OMS and choose the My Dashboard option.
Click the Customize button to add a new dashboard item.
Add the query that you created:
A sample of the results for these dashboards is shown below:
And we can alert on this query to indicate when we want to provide a notification that the server is unhealthy. If we match the Operations Manager approaches this would mean that we need to combine the results of both the % processor time and processor queue length metrics to result in a single alert. As we found in the free disk space blog post, that’s not currently viable in any syntax that I have been able to identify. The best approach I have found is to create an alert condition for each of these two individual performance metrics and then alert if either of these breaches their threshold.
Processor Queue Length:
To change the above query we need to bake in the threshold that we want for the counter. The query below takes the query generated earlier in this blog post and adds the threshold for processor queue length as > 15.
Type=Perf ObjectName:System AND CounterName:”Processor Queue Length” AND CounterValue>15 AND TimeGenerated>NOW-1HOUR
% Processor Time:
The same process results in the query below to identify when % Processor Time is > 95.
Type=Perf ObjectName:Processor AND CounterName:”% Processor Time” AND CounterValue>95 AND TimeGenerated>NOW-1HOUR
(from an earlier blog post in this series, if you want a quick way to test a query run it with the value set above and if that does not result in data reverse the > or < in the query).
Combining the queries for an alert rule:
We can combine these queries to result in a single query which will alert on either condition. The syntax for this is now:
Type=Perf (ObjectName:System AND CounterName:”Processor Queue Length” AND CounterValue>15 AND TimeGenerated>NOW-1HOUR) OR (ObjectName:Processor AND CounterName:”% Processor Time” AND CounterValue>95 AND TimeGenerated>NOW-1HOUR)
We save this as the “Processor threshold exceeded – Alert” as shown below.
And then we activate the alert by creating an alert rule for the search which was saved in the configuration shown below.
And we now have alerting for high processor conditions for our servers in OMS!
Testing the queries, dashboard and alerts:
So how do we test this to validate that the alert rules will fire? For my tests I used CPUStress to give it a bit of a push.
In OMS we can quickly see the results of the increased processor utilization on both counters.
After some time with high processor we can see the results impacting the dashboard items which were created.
And the email arrived notifying (a subset is shown below) that the processor condition(s) thresholds had been breached.
Summary: The approach explained in this blog post shows an example of how OMS can be used to provide similar functionality in terms of server monitoring to what we have been working with in Operations Manager for processor threshold monitoring. Once the queries can be updated to reflect data gathered on a smaller time increment this will be extremely similar to the functionality which we are used to within Operations Manager.
Thank you to Tao, Stan and Pete for their knowledge of how near realtime performance counters work in OMS. This was invaluable to my work on this approach!
In the next part of this blog series I will be introducing a surprise “what-if” related to server health monitoring with OMS.