How to Monitor a Software RAID device
by
zenoss
—
last modified
2007-06-08 14:47
Instructions from the mailing list on how to write a script, include it in snmp output, and add the datapoint to Zenoss. Generally useful!
I needed to monitor the state of Linux mdX (/dev/md0 , a software RAID
for the less knowledgeable) device because the server it was on had no
access to e-mail server and so the normal way of configuring mdmonitor
to send alerts would not work. Also I wanted this aspect of monitoring
to be connected to Zenoss. A week earlier I had had a nasty surprise to
discover that one of the causes of server lag was a broken md0 device
that no one had noticed.
Usually the state evaluation of md-devices is done manually by simply looking at the file /proc/mdstat and judging from there.
After pondering and wondering how I could do it with Zenoss, I came to the following solution.
- I verified md0 to be in correct state and then recorded this state in file
/etc/snmp/mdstat_correct by simply doing:
cat /proc/mdstat > /etc/snmp/mdstat_correct
- I created a short script to compare current state to recorded state:
#!/bin/bash
## compares mdstat to previously recorded version. If no match is made, then return !=0. Ok return is 0
diff /proc/mdstat /etc/snmp/mdstat_correct > /dev/null
RET=$?
if [ $RET -ne 0 ]; then
cat /proc/mdstat
fi
exit $RET
- Checking that the script permissions were
correct and it worked right I then proceeded to configure the Linux
snmp daemon to include my new script in it's output. For that I added a
line to /etc/snmp/snmpd.conf as follows:
exec md_check /etc/snmp/check_md.sh
- Checked that the script also produced the right result over snmp I did:
snmpwalk -On -v 2c -c public servername 1.3.6.1.4.1.2021.8
The result in case of an incorrect state (I altered the
recorded state file to simulate the error so, the check would fail
even, when the md0 device was OK) was something like:
.1.3.6.1.4.1.2021.8.1.1.1 = INTEGER: 1
.1.3.6.1.4.1.2021.8.1.2.1 = STRING: md_check
.1.3.6.1.4.1.2021.8.1.3.1 = STRING: /etc/snmp/check_md.sh
.1.3.6.1.4.1.2021.8.1.100.1 = INTEGER: 1
.1.3.6.1.4.1.2021.8.1.101.1 = STRING: Personalities : [raid1]
md0 : active raid1 sda7[0] sdb7[1]
150046976 blocks [2/2] [UU]
unused devices: <none>
.1.3.6.1.4.1.2021.8.1.102.1 = INTEGER: 0
.1.3.6.1.4.1.2021.8.1.103.1 = STRING
Great, exactly what it should be. When there was no error, then the output would be like:
.1.3.6.1.4.1.2021.8.1.1.1 = INTEGER: 1
.1.3.6.1.4.1.2021.8.1.2.1 = STRING: md_check
.1.3.6.1.4.1.2021.8.1.3.1 = STRING: /etc/snmp/check_md.sh
.1.3.6.1.4.1.2021.8.1.100.1 = INTEGER: 0
.1.3.6.1.4.1.2021.8.1.101.1 = STRING:
.1.3.6.1.4.1.2021.8.1.102.1 = INTEGER: 0
.1.3.6.1.4.1.2021.8.1.103.1 = STRING:
-
From there on I needed to see it it could be now somehow
attached to Zenoss and become an event in case a wrong state was
occurring. The only thing that I could find sufficiently configurable
was the perfconf section.
So I added a new Datasource naming it "md_dev_OK" and setting it's
OID to 1.3.6.1.4.1.2021.8.1.100.1 (source type=SNMP, Enabled=true).
Under md_dev_ok I added a DataPoints entry named "md_check", but added nothing into it (left it with defaults).
- Now the only thing left to do was to configure a Treshold under
perfconf. Since my datasource would only be providing either a 0 or 1
as output (see the above snmpwalk results), it was pretty
straightforward. Marked "md_dev_OK_md_check" as the data source and set
Max Value to be 0 (Min Value left empty, Event Class changed to
/Perf/Filesystem, severity=Error, Enabled=true)
And.. voila! It started working. Soon as I simulated the error, I got the event in Zenoss.
Changing the recorded state back to real I found everything to be error-free again also in Zenoss.
--Kaido Lepisto <kaidol at hot dot ee>