As part of my lab, I have one server which provides storage for the Hyper-V servers with SMB3. The server uses Windows Storage Spaces to storage for each of the nodes of my Hyper-V cluster. Recently I started experiencing extreme issues with performance with some of my virtuals. After some digging I started finding indications that one of the disks in my primary storage space was having errors. The errors were event ID #7 from a source of disk as shown below. They were occurring on drives #12 and #5 in the system.
Finding the dying drive:
The next step to debug was to use server manager to identify the health of my storage pools. As shown below, Storage Pool X had an error on drive #5 as shown below (PhysicalDisk5).
From the disks view I could now identify that the other drive number indicating issues was actually the Virtual Drives Storage Pool so while the error showed on two disks (#5 and #12), the error was actually only on PhysicalDisk5 (since disk #12 represented the storage pool the physical disk was used in).
Attempts to repair the virtual disk were unsuccessful.
Based upon my later experiences my recommendation would now be to add an additional drive to the existing storage space, and retire the non-functional drive from the system. The challenge was how to identify which drive in the system was actually the one that is failing. By right-clicking on the drive, you can toggle the drive light if that’s an option. In my lab environment this isn’t an option.
I have multiple drives that are the same make and manufacturer, so the challenge was how to identify which drive had actually failed. To find this out we need to know the serial number which we can see on the properties of the drive as shown below. The serial number is unique to the drive.
To find this serial number, I shut down the server and checked each of the drives of this capacity until I identified the matching serial number (the screenshot below is from the drive which had failed, the one above shows the same type of a drive which is still functional in the system).
Knowing that the error is occurring:
Operations Manager may well alert on this condition as part of the base Operating System management packs. If it does not, this could be easily added by creating a rule of the following configuration targeted to the Windows Servers class.
The example below shows the alert rule from the Windows NT event log.
Which is monitoring the system event log.
For event ID 7 from the event source of disk.
And will generate an alert for this condition as shown below.
How to replace a disk in a storage space: http://social.technet.microsoft.com/wiki/contents/articles/11382.storage-spaces-frequently-asked-questions-faq.aspx#How_do_I_replace_a_physical_disk
Summary: If you have a dying drive in a storage pool and you cannot toggle the drive light, go to the properties of the drive and write down the serial number for the drive. Match this serial number to the serial number on the drive to identify the failed drive in the system. Operations Manager can easily detect this condition and alert if it is found to be occurring on a system in the environment.