sysadmin-tips-and-tricks: Solaris - ZFS disk failure reporting

Wednesday, 12 December 2012

Solaris - ZFS disk failure reporting

Our previous servers had Solaris 9 running and when we replaced the hardware we also moved to Solaris 10.

On the old servers they had a SVM script, running regularly via crontab, which monitored disk status. Any failure messages would be emailed to a shared monitored mailbox.

A bit of digging around and it didn't make sense to run the same script on the new servers as we had also switched to ZFS. So we could monitor the status of the ZFS (the disks are in pairs and therefore mirrored) using the zpool status command.

Steps:

1. Create script.
2. Configure mail relay.
3. Schedule script frequency in crontab.

1. Create script

# vi /usr/local/zfscheck

#!/usr/bin/ksh
zpool status -x | grep 'all pools are healthy'
if [ $? -ne 0 ]; then
    date > /var/tmp/zfscheck.log
    echo >> /var/tmp/zfscheck.log
    hostname >> /var/tmp/zfscheck.log
    echo >> /var/tmp/zfscheck.log
    zpool status -xv >> /var/tmp/zfscheck.log
    cat /var/tmp/zfscheck.log | mail -s "Disk failure in server : `hostname`" name@mailaddress
fi

(save and exit)

# chmod +x zfscheck

2. Configure mail relay

Edit the sendmail.cf file with your mail relay information by editing the line:
# "Smart" relay host (may be null)
DS

# vi /etc/mail/sendmail.cf

# "Smart" relay host (may be null)
DSmailrelay.yourdomain.com

(On our Exchange Front Ends we edited the the SMTP node with the server IP address to enable relay.By default Exchange is set to deny relaying).

Restart the sendmail service:
# svcadm restart sendmail

3. Schedule the task to run every 30 minutes

To set an editor for crontab:
# bash
# export EDITOR=vi
# crontab -e

NOTE: If you are in the default shell (bourne) then you have to use the following to be able to edit using vi:
# EDITOR=vi
# export EDITOR

Edit crontab with the following setting then exit and save:

# ZFS pool check
0,30 * * * * /usr/local/zfscheck

Check that the entry has taken:
# crontab -l

I checked that the whole thing worked by setting it all up on a test server and then pulled a drive out. After a few minutes I got an email!

sysadmin-tips-and-tricks