Get in Touch

Course Outline

Introduction to AIOps

  • What is AIOps and why it matters.
  • Traditional monitoring versus AIOps-driven observability.
  • AIOps architecture and key components.

Collecting and Normalising Operational Data

  • Types of observability data: metrics, logs, and traces.
  • Ingesting data from multiple sources (servers, containers, cloud).
  • Using agents and exporters (Prometheus, Beats, Fluentd).

Data Correlation and Anomaly Detection

  • Time series correlation and statistical methods.
  • Using machine learning models for anomaly detection.
  • Detecting incidents across distributed systems.

Alerting and Noise Reduction

  • Designing intelligent alert rules and thresholds.
  • Suppression, deduplication, and alert grouping.
  • Integration with Alertmanager, Slack, PagerDuty, or Opsgenie.

Root Cause Analysis and Visualisation

  • Using dashboards to visualise metrics and detect trends.
  • Exploring events and timelines for root cause analysis (RCA).
  • Tracing issues across layers with distributed tracing tools.

Automation and Remediation

  • Triggering automated scripts or workflows from incidents.
  • Integration with ITSM systems (ServiceNow, Jira).
  • Use cases: self-healing, scaling, traffic rerouting.

Open Source and Commercial AIOps Platforms

  • Overview of tools: Prometheus, Grafana, ELK, Moogsoft, Dynatrace.
  • Evaluation criteria for selecting an AIOps platform.
  • Demo and hands-on with a selected stack.

Summary and Next Steps

Requirements

  • A solid understanding of IT operations and system monitoring concepts.
  • Experience with monitoring tools or dashboards.
  • Familiarity with basic log and metric formats.

Audience

  • Operations teams responsible for infrastructure and applications.
  • Site Reliability Engineers (SREs).
  • IT monitoring and observability teams.
 14 Hours

Number of participants


Price per participant

Provisional Upcoming Courses (Require 5+ participants)

Related Categories