This blog post is a summary of the white paper “Exploring AIOps: Cluster Analysis for Events.” You can download the full white paper here.
AIOps, i.e., artificial intelligence for IT operations, has become the latest strategy du jour in the IT operations management space to help address and better manage the growing complexity and extreme scale of modern IT environments. AIOps enables some unique and new capabilities on this front, though it is quite a bit more complicated than the panacea that it is made out to be. However, the underlying AI and machine learning (ML) concepts do help complement, supplement and, in particular cases, even supplant more traditional approaches to handling typical IT Ops scenarios at scale. Our recent blog post, "The Truth About AIOps," articulates the Zenoss perspective on considering the applicability of AIOps in terms of the foundational capabilities that are typically relevant to an infrastructure and operations (I&O) management portfolio.
So, how does one apply and leverage AIOps to better manage I&O environments and activities in practice? Gartner, in their Market Guide for AIOps Platforms, recommends an incremental approach. Namely, start small with less critical applications, and apply the more straightforward AIOps aspects, such as categorization, correlation and anomaly detection, to start deriving value and to drive better business outcomes for the use cases under consideration.
An AIOps platform has to ingest and deal with multiple types of data to develop a comprehensive understanding of the state of the managed domain(s) and to better discern the push and pull of diverse trends in the environment, both overt and subtle, that may destabilize critical business outcomes. In this blog post, we will take a look at an AIOps approach to handling one of the fundamental data types: events.
An event is a record of a notable change in state in the environment. This could be a new service or device coming online, the measured state of a component or resource breaching a threshold, an application consistently failing to connect to an external service — the possibilities are endless. Also, even though some events are one-offs, most events are repeated for the duration of the state they are recording based on the evaluation cycle of the event source. This translates to a large volume of events from even a small managed environment and potentially orders of magnitude more from data center and cloud-scale environments.
Cluster analysis, or clustering, is a core unsupervised learning technique that attempts to group the data points in a dataset into groups, or clusters, such that data points in the same cluster are similar to one another while being sufficiently different from data points in another cluster.
Cluster analysis can be applied to a number of different event use cases. A simple case is an alternative to or an augmentation of the traditional event deduplication approach, wherein individual instances of the same duration-based event are grouped into the same cluster. In this case, the features used for the analysis include the event summary and other context fields that help contribute to identifying the underlying condition as well the time stamp. The clusters represent event durations, and the members represent event instances recording updates for that particular episode. A summary event for a cluster can represent the event episode referenced by all of the instances in that cluster, helping to reduce event volume.
To read more, download the full white paper: "Exploring AIOps: Cluster Analysis for Events."