Artificial Intelligence for IT Operations – AIOPS

By Pandian Ramaiah February 27, 2024

We collect metrics, events, logs and traces (MELT) to attain observability. When we get an event, we assign it to rightful owner, who in turn take a decision to resolve it.

Let’s take a simple operation to understand this. Assume, a server is running at 99% CPU.

Metrics collection will identify it
Observability platform will send an notification to IT Service Management solution (or directly email to the user)
Let’s assume the problem is assigned to a application owner, assuming it is application problem.
Application check the observability platform for root cause. It says the CPU utilization is due to high CPU consumption of web server.
He restarts the service, and let’s assume the problem is resolved.

AIOps leverages Machine Learning and Artificial Intelligence to automate and optimize such IT Operations. The above example is a simple use case. But in the real time, event correlation, automated or manual root cause analysis are involved to decide a remediation.

An effective AIOps platform may perform the following duties –

A holistic AIOps platform may collect the MELT on its own. Rest of the platforms may ingest data from monitoring, log aggregation and security tools.
Sending notification about the problems. The problems are based on thresholds. The thresholds may be static or dynamic.
Analyze the root cause – See if the problem is due to network/infrastructure/application/others. This is a crucial task. A faulty root cause, may lead to incorrect decision by AIOps tool.
Based on the root cause, the AIOps tools are supposed to provide recommendations and automation. How efficiently or accurately an AIOps platform can decide, is continuously a matter of debate. Based on such analysis, it may do certain corrective actions such as adding additional resources, kick-start auto-scaling, restart of the necessary service or reboot the whole server.

Sometimes, I may decide to restart the service. Sometimes, I may decide to kick off auto-scaling to distribute the load. Sometimes, I may decide to reboot the machine. These decisions are based on real-world considerations.. How an AIOps can decide its own decision to resolve this problem?

While AIOps tools are constantly evolving and gaining capabilities, it is debatable whether they are currently designed to fully replace human decision-making in critical situations. But the tool would certainly help to analyze the data pattern, suggesting level 1 remediation task such as restarting the service, kick-start auto-scaling, rebooting the whole machine.

—

This post is written as part of #WriteAPageADay campaign of BlogChatter

Last updated on February 27, 2024

Pandian Ramaiah

Pandian is an IT professional with expertise on SRE, application performance monitoring, application development, application modernization, and performance tuning. He works with his clients on site reliability, performance tuning, monitoring strategy and operation management. He writes open source code during his free time. Otherwise, he goes out to read / run / swim.

View All Posts

Artificial Intelligence for IT Operations – AIOPS

Like this:

Related

Subscription

Spread the word:

Like this:

Related

Discover more from SRE Digest