Using OMS to visualize server health based on free disk space | Quisitive

[Update 11/29/2017: This blog post series has been superseded by a solution built to visualize server and client information which is available at: https://www.catapultsystems.com/cfuller/archive/2017/11/28/updating-the-server-and-client-performance-solution-to-the-new-query-language/. Please note that query examples in this deprecated blog post are for the old query language and will not work in the current query language.]

This blog post is part of a series where we will look at Key Performance Indicators for servers and how OMS can be used to work with these KPI’s to determine the health of a server. The blog post series includes:

The order of these blog post have been written based upon the complexity required to create that particular type of server health monitoring. Free disk space monitoring in Operations Manager is relatively complicated (as you will see in the blog post below). For less complicated examples see the Server up/down, and Free memory blog posts referenced above before reviewing this blog post.

In Operations Manager we determine the health of a system from a free disk perspective based upon the amount of percentage of free space and the free space in MB. This also varies depending on whether the drive is a system drive or a nonsystem drive. Details on this are below from the article on Operations Manager Key Performance Indicators available from Windows IT Pro:

“System drives are defined as healthy or in a warning or error state based on the following conditions:

Nonsystem drives are defined as healthy or in a warning or error state based on the following conditions:

The counters that we are looking for relate to both the amount of free space in MB and the percentage of free space. This is made more complex by two factors: (If you were wondering why I didn’t start with free disk space in this series here’s why)

  1. There are different thresholds based upon what type of drive it is (system or non-system)
  2. There are both warning and error thresholds which are defined for these drives 

Adding performance counters to OMS:

In OMS, it is simple to add both of the memory counters. To add these counters we go to the Settings page, on the Data tab and then open the Windows Performance counter section. In this case we need to find a counter which was not already added, so a quick trick is to type the beginning of the name of the performance counter to find what you are looking for. In this case we need LogicalDisk(*)% Free Space.

When a new counter is added it is shown with the purple line on the left as shown below. Once this is saved (top left corner) the purple line will disappear and the change will be committed.

For health based on free disk space we want both: LogicalDisk(*)% Free Space and LogicalDisk(*)Free Megabytes.

Developing queries for the counters:

Next we need to develop the queries which we will use for the memory information. The example below shows both of the relevant metrics for each of the systems which they are being collected for in my labs within this OMS workspace.

Type=Perf (ObjectName=LogicalDisk)

To create a query which specifies only the two counters we can use the options on the left side (once data has populated so that they appear on the left under the CounterName section).

Once we choose metrics we can see the two counters (LogicalDisk(*)% Free Space and LogicalDisk(*)Free Megabytes). We can also see that counters from multiple drives are displayed (see the example below showing both the C and E drives).

Type=Perf (ObjectName=LogicalDisk) (CounterName=”Free Megabytes” OR CounterName=”% Free Space”) | sort Computer

Querying LogicalDisk(*)% Free Space:

The queries below give the Available Mbytes memory counter information based upon the last 8 minutes of time based on the highest value in the timeframe specified.

Type=Perf ((ObjectName:LogicalDisk AND CounterName:”% Free Space” )) AND TimeGenerated>NOW-8MINUTES | Measure Max(CounterValue) as Counter by Computer

This unfortunately (currently) does not result in any data being returned. Later in this article we will review the reason behind that situation (see the “Important Note” section of this blog post for details).

If we move the timeframe up to 1 hour we see the results that we would expect.

Type=Perf ((ObjectName:LogicalDisk AND CounterName:”% Free Space” )) AND TimeGenerated>NOW-1HOUR | Measure Max(CounterValue) as Counter by Computer

Querying LogicalDisk(*)Free Megabytes:

The queries below give the % Committed Bytes in Use memory counter information based upon the last 8 minutes of time based on the lowest value in the timeframe specified. There is one value available based on the query but it does not represent each of the systems in the environment. Later in this article we will review the reason behind that situation (see the “Important Note” section of this blog post for details).

Type=Perf ((ObjectName:LogicalDisk AND CounterName:”Free Megabytes” )) AND TimeGenerated>NOW-8MINUTES | Measure Max(CounterValue) as Counter by Computer 

Type=Perf ((ObjectName:LogicalDisk AND CounterName:”Free Megabytes” )) AND TimeGenerated>NOW-1HOUR | Measure Max(CounterValue) as Counter by Computer 

Saving the searches:

Now that we have our working queries, most of the hard work is done. We can now visualize this information in the My Dashboard page. To do this we save our queries under a category which we will use throughout this blog series “Server Health”.

Adding dashboard items:

To add dashboard items, go to the top page for OMS and choose the My Dashboard option.

Click the Customize button to add a new dashboard item.

Add the query that you created:

As dashboard items. An example is below:

Click customize again to save the new dashboard items in place.

Creating alerts

And we can alert on this query to indicate when we want to provide a notification that the server is unhealthy. If we match the Operations Manager approaches this would be the following:

This means that we are looking at four different alert rules, each with relatively complex logic required to make them happen. Let’s see if we can build the first of these queries and make it work. We’ll start by breaking the first query down into two parts:

  1. Can we query free space less than 10 percent for a “C” drive for the % Free Space counter
  2. Can we query free space less than 200 MB for a “C” drive for the Free Megabytes counter

Alerting on % Free Space:

Then we see if we can combine these queries. We start with our original free space query and remove the measure Max piece of it we have the following:

Type=Perf ((ObjectName:LogicalDisk AND CounterName:”% Free Space” )) AND TimeGenerated>NOW-1HOUR

From there we need to restrict this down to only the C drive.

Type=Perf ((ObjectName:LogicalDisk AND CounterName:”% Free Space” )) AND TimeGenerated>NOW-1HOUR (InstanceName=”C:”)

And then we need to specify the value for the counter to be less than 10.

Type=Perf (ObjectName:LogicalDisk AND CounterName:”% Free Space” AND InstanceName=”C:” AND CounterValue<10 AND TimeGenerated>NOW-1HOUR)

To verity this query, we can reverse the query so that instead of checking for a value of < 10 it is now looking for a value of 10.

Type=Perf (ObjectName:LogicalDisk AND CounterName:”% Free Space” AND InstanceName=”C:” AND CounterValue>10 AND TimeGenerated>NOW-1HOUR)

Alerting on Free MB:

The query for Free MB is just like the one for % Free Space, just updated to reflect the new counter name and new value.

Type=Perf (ObjectName:LogicalDisk AND CounterName:”Free Megabytes” AND InstanceName=”C:” AND CounterValue<200 AND TimeGenerated>NOW-1HOUR)

And as with the previous example, to test the above conditions for alerting we can also reverse the threshold to cause it to alert in the reverse condition such as:

Type=Perf (ObjectName:LogicalDisk AND CounterName:”Free Megabytes” AND InstanceName=”C:” AND CounterValue>200 AND TimeGenerated>NOW-1HOUR)

Querying for the first of the four conditions identified:

The first of the four sets of queries for low disk space are listed below. These match the condition: “When free space is less than 10 percent and actual free space is less than 200 MB on the C drive, warning alert” look like the following:

Type=Perf (ObjectName:LogicalDisk AND CounterName:”Free Megabytes” AND InstanceName=”C:” AND CounterValue<200 AND TimeGenerated>NOW-1HOUR)

Type=Perf (ObjectName:LogicalDisk AND CounterName:”% Free Space” AND InstanceName=”C:” AND CounterValue<10 AND TimeGenerated>NOW-1HOUR)

The next step is to save the queries shown above with unique naming (to make them easier to find both when creating alert rules and to find them later when you need to update them). An example would be:

To alert on these conditions we can create an alert rule for each of the queries shown above such as:

We can also combine the two queries resulting in a single one which alerts if either condition occurs.

Type=Perf (ObjectName:LogicalDisk AND CounterName:”Free Megabytes” AND InstanceName=”C:” AND CounterValue<200 AND TimeGenerated>NOW-1HOUR) OR (ObjectName:LogicalDisk AND CounterName:”% Free Space” AND InstanceName=”C:” AND CounterValue<10 AND TimeGenerated>NOW-1HOUR)

(As a reminder, testing the query can be done by reversing the <> signs in the query above if there are not values which match such as the example below):

Unfortunately, using an AND condition results in no values. This is logical but unfortunate. It is logical that the counters do not match so therefore the AND condition would never occur. It is unfortunate because this approach is not quite the same as what we do within Operations Manager currently.

Type=Perf (ObjectName:LogicalDisk AND CounterName:”Free Megabytes” AND InstanceName=”C:” AND CounterValue<200 AND TimeGenerated>NOW-1HOUR) AND (ObjectName:LogicalDisk AND CounterName:”% Free Space” AND InstanceName=”C:” AND CounterValue<10 AND TimeGenerated>NOW-1HOUR)

The saved search and an alert for the two conditions above looks like this:

Sample alert (reversed values, created intentionally)

Addressing each of the four low disk space configurations:

Now that we have a working query for the first of these four conditions we can alter that query to represent the other three. The four conditions and their appropriate queries are listed below:

NOTE: The queries below have been updated so that they do not fire as duplicate conditions such as a low disk space condition where it is below both the warning and error thresholds.

The full set of alert rules is shown below (with a few others I am working on):

A sample email notification is shown below when I ran tests with the condition reversed (IE: there was too much free space not too little).

Important note #1: Based upon my test results and my communications with a few of the other CDM focused MVP’s it appears that while we can currently gather data in near real time (NRT), the queries are not able to be used to represent that data. The queries represent the data which is available once it has been indexed which occurs every 30 minutes. So why did I set the query to every hour instead of every 8 minutes? This period of time provides at least two data points which will exist after aggregation (one every 30 minutes). So in reality the approach in this blog will provide notification when memory is insufficient for a server over a one hour period and it will notify you every 15 minutes when these conditions apply.

Important note #2: The queries which I have developed do not exactly match what we are working with in Operations Manager. In Operations Manager the alert is created when both conditions occur (% free space and free megabytes). In my examples, the alert occurs if either of these conditions occur.

Important note #3: This approach will alert on the condition every hour as long as the condition is continuing to occur (IE: If a drive is too low on disk space it will alert every hour indicating that it is still too low on disk space).

Summary: The approach explained in this blog post shows an example of how OMS can be used to provide similar functionality in terms of server monitoring to what we have been working with in Operations Manager from a memory utilization perspective. Once the queries can be updated to reflect data gathered on a smaller time increment this will be extremely similar to the functionality which we are used to within Operations Manager. Additionally, once there is a way to query for when both conditions occur (or if someone can show me how) that will match very closely to the functionality which we are familiar with in Operations Manager.

Thank you to Tao, Stan and Pete for their knowledge of how near realtime performance counters work in OMS. This was invaluable to my work on this approach!

In the next part of this blog series I will be looking at server health from a processor perspective in OMS.