Forrester: Demystifying Hybrid Solutions and Architectures
Hybrid infrastructure is a strategy, not a solution — and it's more than just cloud.
Why Customers Choose Us
Discover why the largest companies in the world choose Zenoss.
Customer Support Portal
Zenoss Learning Center
Zenoss Partner Portal
Become a Partner
Top 5 Focus Areas to Succeed With DevOps
Forrester shares the tools, technologies and best practices to meet the challenges of today's modern IT environments.
Learn. Discuss. Participate.
Join thousands of Zenoss users and experts to learn, discuss and participate in the Zenoss Community.
Hybrid IT Monitoring
Zenoss provides complete visibility into physical, virtual, cloud and converged environments.
Request A Demo
The Capacity ZenPack adds support for managing CPU, memory, network, and storage capacity across any managed resources.
The following features are available to help you understand historical, current, and projected future capacity usage across any systems managed by Zenoss. They are intended to address the following use cases.
The Capacity Usage report found in the Capacity Planning folder, located in the Reports section of Zenoss, provides top 10 capacity usage across the entire environment.
The top 10 are selected based on one of the following summary functions (Sort By).
Note: Projected Breach and Projected Maximum are based on the daily maximum values over the last 90 days.
The top 10 can be filtered by the following criteria.
The Capacity Usage View is added for all devices and components in the system. It looks similar to the Capacity Usage Report, but it's function is different.
The only configuration available on the view is to device whether you want to look at daily average, maximum, or minimum values. There's no need to choose the capacity type because all available types of capacity data for the device or component will be shown. Similarly there is no need to specify a sort or search because there can be at most one line on the chart for each type of capacity.
Note: The device-level view only shows capacity data for the device, and not for any of the device's components. For this reason you won't see anything on the device-level view on device types like vCenters or UCS Managers that only have component-level capacity data.
A new CapacityThreshold threshold type is what allows capacity-specific information to be identified in data already being monitored by Zenoss.
The CapacityThreshold type of threshold has the following configuration options in common with most other types of thresholds.
The following additional configuration options must also be set.
Once configured, capacity thresholds can create two different types of events: current breaches, and projected breaches.
Current breach events can be created every time the threshold's configured datapoints are collected. This is comparable to how most other types of thresholds work. No historical or projected future data is considered. An event will be created if the current used value is a higher percentage of the total value than the configured percent threshold.
Current breach events will have the following standard event fields set as follows.
Current breach events will also have the following additional details.
Projected breach events are created once per day (at midnight, UTC) by the Capacity service. The last 90 days of historical values are considered to project when those 90 day trends will result in having the configured threshold breached. A projected breach event will be created if the projected maximum value exceeds the configured threshold within the next 90 days.
Projected breach events will have the following standard event fields set.
Projected breach events will also have the following additional details.
This section covers how to use the Capacity ZenPack.
After installing the Capacity ZenPack there are a few things you should know.
If you attempt to use the Capacity Usage Report or Capacity Usage View immediately after installing the ZenPack, you will likely receive a warning of No Capacity Data Available. This is normal, and it is likely occurring because you need up to 24 hours of capacity data in the system before these views can work.
If you still see this warning after the ZenPack has been installed for 24 hours, there are some other reasons for it.
Once some capacity data is available, it is possible that you will submit a configuration for which no results exist.
This, "Expecting Data?" result only appears on the Capacity Usage Report, not any Capacity View because it's possible that your configured criteria on the report don't match any resources for which capacity data exists.
Some specific reasons why you may get this result.
Within the first five days of using Capacity you will find that the chart on the report and views will appear, but there will be no projections.
You will also see that the 90th Day Projection column in the table indicates Insufficient Data. This is because there is a built-in limit that prevents projecting any future values until at least five days of historical data exist.
Note: The amount of history available depends only on when capacity thresholds were added to datapoints. Having 90 days of CPU, memory, network, or storage datapoints prior to installing the Capacity ZenPack, or prior to adding capacity thresholds for those datapoints doesn't count. The reason for this has to do with the data normalization performed by the Total Expression and Used Expression of the threshold. Previously-existing datapoint values are not known to be normalized into used and total values in the correct native units, or to a used percentage of total.
One last unexpected thing you may see in the report and views is Unpredictable Data.
To avoid making useless projections that are extremely likely to come to pass we have added a variance limit. If the past 90 days of historical data varies too drastically, we will not attempt to project future values. You will see this on the charts as lines that don't extend beyond Now, and are noted in the 90th Day Projection column of the chart as Unpredictable Data.
Note: As of version 1.0.2, variance limits are disabled.
Above we have covered several reasons why you won't see any data, or any projected data on the report and views. You may be wondering whether or not you will receive capacity events when these conditions exist.
The answer is that you will get events for current capacity threshold breaches in all of these situations. However, you will not receive events for projected breaches in the No Data, Insufficient Data, or Unpredictable Data cases.
Most of the thresholds configuration fields such as datapoints, severity, and event class are the same as any type of threshold. So let's focus on what's different in a CapacityThreshold
Capacity-specific threshold configuration properties.
Capacity Type must be one of: cpu, memory, network, or storage. Each of these types has one specific native unit.
Native units by capacity type.
Used Expression is a Python expression that can use any datapoint selected in the DataPoints field, and any modeled properties on the resource (here) to return the current amount of used capacity in the capacity type's native unit.
CPU (vSphere Host)
cpuUsage_cpuUsage * 0.01
In this case we have to multiple vSphere's cpuUsage datapoint by 0.01 to convert it to the CPU capacity type's native unit of percentage. vSphere's cpuUsage datapoint will be 0 for completely idle, and 10,000 for completely busy.
This is the simplest possible example. The mem_MemUsed datapoint is already collected in the correct native units for the memory capacity type: bytes.
Network Interface (SNMP)
max(ifHCInOctets_ifHCInOctets, ifHCOutOctets_ifHCOutOctets) * 8
We take the maximum of two datapoints because we're assuming a full-duplex interface that can receive and transmit simultaneously at up to the interface's maximum speed. We then multiply that maximum by 8 to convert it from the bytes/sec collected by those datapoints to the bits/sec required by the network capacity type's native unit.
Storage (EMC VNX File Storage Pool)
FileStoragePool_usedSize * 1048576
We must multiply the FileStoragePool_usedSize datapoint value by 1,048,576 (1,024 * 1,024) to convert it from the collected megabytes units to bytes.
Total Expression is a Python expression that can use any datapoint selected in the DataPoints field, and any modeled properties on the resource (here) to return the current amount of total capacity in the capacity type's native unit.
A simple example that is almost always the case when it comes to CPU thresholds. Since CPU's native type is a 0-100 percentage, the total should always be 100.
Servers have their total amount of memory modeled in the here.hw.totalMemory property. It's already in the proper native units: bytes. So we can use it directly as the total expression
here.speed or 1e9
We use the modeled speed property of the interface which happens to already be in the correct units of bits/sec. For safety's sake we add or 1e9 to default to a 1Gbps value in cases where the interface doesn't have its speed property modeled.
FileStoragePool_size * 1048576
We must multiply the FileStoragePool_size datapoint value by 1,048,576 (1,024 * 1,024) to convert it from the collected megabytes units to bytes.
See Installed Thresholds for more examples of used and total expressions.
Percent Threshold is a percentage value between 0.0 and 100.0. When the result of Used Expression becomes more than this percent of the result of Total Expression, a threshold exceeded event will be created. This is the threshold used both for current and projected threshold breaches.
Note: As of version 1.0.2, polynomial projections are disabled. Only linear trends are supported.
This ZenPack uses a custom projection algorithm both for projecting when capacity thresholds will be exceeded for eventing purposes, and for plotting the projected capacity usage on the Capacity Usage Report and Capacity Usage View
The projection algorithm used will vary for every datapoint in the system to which a capacity threshold is applied. Each datapoint's historical values will be analyzed to determine whether a linear or polynomial function best predicted historical values. That "best fit" algorithm will then be used to project future values.
An example of a case where a linear function will be used is storage capacity of a large storage pool that grows in a predictable fashion at around 100GB per day. A linear algorithm will project this 100GB per day growth to continue into the future.
A polynomial example would be a more realistic case where the same storage pool starts out growing at 100GB per day, but over time the growth accelerates until 150GB, 200GB, or 500GB are added per day. A properly fit polynomial algorithm will be able to take the acceleration of growth into account and project the capacity to be exceeded at a sooner, more accurate date.
The Capacity Usage Report - CPU screenshot above shows an example where the red line was best fit by a polynomial function, and all other lines were best fit by a linear function.
Note: A linear function will always be used when there are fewer than 12 days of history for a datapoint. This is because polynomial functions typically require more history to provide better projections. No projection will be attempted when there are fewer than 5 days of history for a datapoint.
Zenoss systems with many capacity threshold instances may benefit from some specific tuning.
The Capacity service does almost all of its work once per day at just after UTC midnight. During this time the previous 90 days of history are queried, the index that allows the Capacity Usage report to work is built, and projected threshold exceeded events are sent.
If you see that this work is taking longer than desired, it can be expedited by increasing the number of instances for the Capacity service. Generally you should find that doubling the instances cuts processing time in half, doubles the CPU usage for the duration, and marginally increases the total memory usage of the Capacity service.
In some cases you may see OpenTSDB or HBase timeout warnings and errors in the Capacity service's logs, or potentially in the UI when viewing capacity reports and views. These can be avoided by configuring a longer HBase scanner timeout.
To configure a longer HBase scanner timeout, you must add the following property to the /etc/hbase-site.xml configuration file of the HMaster and RegionServer services.
This raises the timeout to 3 minutes from its default value of 1 minute.
This ZenPack has the following known limitations.
There is a maximum limit of approximately 1,048,576 capacity threshold instances in a single Zenoss system. A threshold instance is one configured threshold applied to one device or component.
As an example, let's say you have the following capacity thresholds configured.
Now let's say that you have the following counts of these resources in your system.
Now we do the math to figure out how many capacity thresholds instances this will be.
When creating capacity thresholds it is most important to think about cases where they are applied to components. As you can see, increasing the number of filesystems from 4 to 20 per device would have a much larger impact on the total number of threshold instances than increasing the total devices from 11,000 to 55,000.
Note: The 1,048,576 limit is a function of the partitioning of capacity metrics in OpenTSDB, and OpenTSDB's configured tsd.query.filter.expansion_limit value. The Capacity ZenPack hard-codes the number of partitions to 256 for query performance reasons, and the default value of tsd.query.filter.expansion_limit is 4,096. If more than 1,048,576 capacity threshold instances are required, it is possible to increase the limit by increasing OpenTSDB's tsd.query.filter.expansion_limit value. Doubling the value doubles the limit. However, this will result in queries that will take OpenTSDB longer to process.
What can be done when something unexpected or undesirable happens?
You may notice that values shown on the Capacity Usage report, Capacity Usage View, and in capacity threshold events don't seem to match what should be their corresponding values on the device and component graphs of CPU, memory, network throughput, or storage usage.
These discrepancies are typically the result of the aggregation, or "downsampling" performed by the Capacity ZenPack. All values shown on capacity charts, tables, or in projected capacity exceeded events have been "downsampled" into 24 hour buckets that begin at 00:00:00 UTC time, and end at 23:59:59 UTC time each day. For each one of these buckets of time, Capacity will track the average, minimum, and maximum value. This differs from the approach taken by the normal device and component graphs in Zenoss. Those graphs use varying periods for downsampling depending on how far you're zoomed in or out.
In general you should find that the values are roughly similar, and follow a shared shape and magnitude. It may seem strange, but it is likely the case that even though the values can be different, they're both correct.
It is recommended to use capacity management features primarily for longer-term planning when minute-by-minute changes in values are irrelevant, and drill down into the standard graphs when more precise information is required.
The following items will be installed by this ZenPack. Unless otherwise specified, these will also be removed from the system if the ZenPack is removed.
A capacity facade is added for internal Python access, and a corresponding CapacityRouter (capacity_router) endpoint is added for external access. Both make the following methods available.
This ZenPack installs the following event classes. None of the following event classes will be removed from the system if the ZenPack is removed.
This ZenPack installs the following reports.
This ZenPack installs the following services.
This ZenPack installs some, or all, of the following thresholds depending on whether or not the associated ZenPack is also installed on the system. This is done to kick-start your ability to do capacity management without having to wait for future versions of the associated ZenPacks to come with their own capacity thresholds pre-configured.
All of these thresholds are of the CapacityThreshold type, and have their Percent Threshold set to 90.
max(interface_eth_inbytes, interface_eth_outbytes) * 8
int(here.operSpeed or 0) or 1e10
max(etherRxStats_totalBytes, etherTxStats_totalBytes) * 8
max(fcStats_bytesRx, fcStats_bytesTx) * 8
memConsumed_memConsumed * 1024
(here.linkSpeed or 1000) * 1000000
max(nicRx_nicRx, nicTx_nicTx) * 8192
max(ifNet_ifInOctets, ifNet_ifOutOctets) * 8
PrimordialStoragePool_TotalManagedSpace - PrimordialStoragePool_RemainingManagedSpacee
DeviceStoragePool_TotalManagedSpace - DeviceStoragePool_RemainingManagedSpacee
int(here.totalManagedSpace) - UnifiedStoragePool_RemainingManagedSpace
VirtualProvisioningPool_TotalManagedSpace - VirtualProvisioningPool_RemainingManagedSpacee
This ZenPack installs the following threshold types.
View the discussion thread.
This ZenPack is developed and supported by Zenoss Inc. Commercial ZenPacks are available to Zenoss commercial customers only. Contact Zenoss to request more information regarding this or any other ZenPacks. Click here to view all available Zenoss Commercial ZenPacks.