Most SQL folk know that we need to run DBCC frequently and the lucky ones who have attended some Paul Randal training (www.sqlskils.com) will be well aware of the intricacies of DBCC CHECKDB and when/how the repair databases and data loss risks.
Well what if you also have a SAN and clustered Hyper-V environment instead of just SQL and you get some disk corruption on your non sql servers.
1) What is the SAN equivalent of DBCC?
Most SAN’s will have a background process that checks for a disk inconsistencies like bad sectors, differences between the mirror and and the partner disk in a RAID set and other things like that. On our MD3000 its turned on by default (see below)
2) How do I know when a get a problem?
The SAN can pick up a corrupted disk issue in to ways:
a) the media scan can find it on a periodic check
b) When reading IO the controller may get a physical error or a Logical error (data was found on a second attempt). This will appear in the SAN event log and if configured correctly the SAN can usually send out an email (see error below)
In out case we got the error and then thought “oh dear”, we have like 50 virtual servers running on that storage array with over 10 volumes. How do we narrow down the issue!
2) Narrowing Down the Issue to which disk and Hyper-V image
The MD3000 like most SAN’s can produce support information, and in this case we can get ZIP file with lots of config and diagnostics data in it. one of the files contains more detail on unreadable sectors
Looking into the “UnreadableSectors” text file we can now see which actual disk had the Physical Errors and what volume is affected:
On the down side in our case this volume is 2TB and runs 20 hyper-v images, and we really don’t want to attempt to restore all of them, so we want narrow down the problem to which hyper-v image and which virtual host.
3) What is the Clustered Shared Volume version of DBCC ?
Now that we know which volumes affected we can ran a “chkdsk” on it to look for bad sectors and/or attempt a repair. We can choose to attempt the repair or just find out which images are affected.
One minor issue is in our case we use “Clustered Shared Volumes” so the normal chkdsk can not be used, we need to use PowerShell to locate the offending image (see below)
Ok, so that explains why the mail server just started blue screening , but good news is that VHD file is just the operating system disk. We could restore that or see if the damage is fixable.
2) Narrowing Down to which files “inside” the Hyper-V image. the disk version of DBCC
Now that the parent volume is “repaired” we can go into the guest and run a chkdsk to see what the damage is and if it is “fixable”. in this case a bunch of windows DLL’s are going to be deleted, some re to do with wireless networking (no big deal on a server), others not too sure…
3) Should we restore backup or repair now we know what the problem is and can delete the offending DLL’s ?
Tricky one, I think like DBCC CHECKDB REPAIR_ALLOW_DATA_LOSS the general guidance is to restore to last known good backup, but In our case we are due to upgrade in a ahem “week or so” to a new operating system/email server and we don’t have any SLA on our email server, so we opted to just fix the errors and run it for another short while then go with the upgrade to exchange 2010.
If the corruption had have been in the exchange data files we would have had to use the “eseutil” which is exchange servers version of “DBCC”, but as these volumes were not affected we were quite lucky.
4) Lessons Learned?
There are may tools that do the DBCC thing for different products. Learn how one works and they kind of all work in the same way
– SQL Server had DBCC CHECKDB
– The SAN has media scan settings in its GUI or CLI
– Clustered Shared Volumes use PowerShell “Repair-ClusteredSharedVolume” cmdLet
– normal windows volumes use “chkdsk”
– exchange server uses “eseutil.exe”