Community

Zenoss Newsletter
Monitored by Zenoss
SourceForge.net Logo

How to Monitor a Software RAID device

by zenoss last modified 2007-06-08 14:47

Instructions from the mailing list on how to write a script, include it in snmp output, and add the datapoint to Zenoss. Generally useful!

I needed to monitor the state of Linux mdX (/dev/md0 , a software RAID for the less knowledgeable) device because the server it was on had no access to e-mail server and so the normal way of configuring mdmonitor to send alerts would not work. Also I wanted this aspect of monitoring to be connected to Zenoss. A week earlier I had had a nasty surprise to discover that one of the causes of server lag was a broken md0 device that no one had noticed.

Usually the state evaluation of md-devices is done manually by simply looking at the file /proc/mdstat and judging from there.

After pondering and wondering how I could do it with Zenoss, I came to the following solution.

  1. I verified md0 to be in correct state and then recorded this state in file

/etc/snmp/mdstat_correct by simply doing:

cat /proc/mdstat > /etc/snmp/mdstat_correct
  1. I created a short script to compare current state to recorded state:
#!/bin/bash
## compares mdstat to previously recorded version. If no match is made, then return !=0. Ok return is 0
diff /proc/mdstat /etc/snmp/mdstat_correct > /dev/null
RET=$?
if [ $RET -ne 0 ]; then
cat /proc/mdstat
fi
exit $RET
  1. Checking that the script permissions were correct and it worked right I then proceeded to configure the Linux snmp daemon to include my new script in it's output. For that I added a line to /etc/snmp/snmpd.conf as follows:
exec md_check /etc/snmp/check_md.sh
  1. Checked that the script also produced the right result over snmp I did:
snmpwalk -On -v 2c -c public servername  1.3.6.1.4.1.2021.8

The result in case of an incorrect state (I altered the recorded state file to simulate the error so, the check would fail even, when the md0 device was OK) was something like:

.1.3.6.1.4.1.2021.8.1.1.1 = INTEGER: 1
.1.3.6.1.4.1.2021.8.1.2.1 = STRING: md_check
.1.3.6.1.4.1.2021.8.1.3.1 = STRING: /etc/snmp/check_md.sh
.1.3.6.1.4.1.2021.8.1.100.1 = INTEGER: 1
.1.3.6.1.4.1.2021.8.1.101.1 = STRING: Personalities : [raid1]
md0 : active raid1 sda7[0] sdb7[1]
150046976 blocks [2/2] [UU]

unused devices: <none>
.1.3.6.1.4.1.2021.8.1.102.1 = INTEGER: 0
.1.3.6.1.4.1.2021.8.1.103.1 = STRING

Great, exactly what it should be. When there was no error, then the output would be like:

.1.3.6.1.4.1.2021.8.1.1.1 = INTEGER: 1
.1.3.6.1.4.1.2021.8.1.2.1 = STRING: md_check
.1.3.6.1.4.1.2021.8.1.3.1 = STRING: /etc/snmp/check_md.sh
.1.3.6.1.4.1.2021.8.1.100.1 = INTEGER: 0
.1.3.6.1.4.1.2021.8.1.101.1 = STRING:
.1.3.6.1.4.1.2021.8.1.102.1 = INTEGER: 0
.1.3.6.1.4.1.2021.8.1.103.1 = STRING:
  1. From there on I needed to see it it could be now somehow attached to Zenoss and become an event in case a wrong state was occurring. The only thing that I could find sufficiently configurable was the perfconf section.

    So I added a new Datasource naming it "md_dev_OK" and setting it's OID to 1.3.6.1.4.1.2021.8.1.100.1 (source type=SNMP, Enabled=true). Under md_dev_ok I added a DataPoints entry named "md_check", but added nothing into it (left it with defaults).

  2. Now the only thing left to do was to configure a Treshold under perfconf. Since my datasource would only be providing either a 0 or 1 as output (see the above snmpwalk results), it was pretty straightforward. Marked "md_dev_OK_md_check" as the data source and set Max Value to be 0 (Min Value left empty, Event Class changed to /Perf/Filesystem, severity=Error, Enabled=true)

And.. voila! It started working. Soon as I simulated the error, I got the event in Zenoss. Changing the recorded state back to real I found everything to be error-free again also in Zenoss.

--Kaido Lepisto <kaidol at hot dot ee>

AddThis Social Bookmark Button
Document Actions