Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
Introduction to AIOps
- What is AIOps and why it matters.
- Traditional monitoring versus AIOps-driven observability.
- AIOps architecture and key components.
Collecting and Normalising Operational Data
- Types of observability data: metrics, logs, and traces.
- Ingesting data from multiple sources (servers, containers, cloud).
- Using agents and exporters (Prometheus, Beats, Fluentd).
Data Correlation and Anomaly Detection
- Time series correlation and statistical methods.
- Using machine learning models for anomaly detection.
- Detecting incidents across distributed systems.
Alerting and Noise Reduction
- Designing intelligent alert rules and thresholds.
- Suppression, deduplication, and alert grouping.
- Integration with Alertmanager, Slack, PagerDuty, or Opsgenie.
Root Cause Analysis and Visualisation
- Using dashboards to visualise metrics and detect trends.
- Exploring events and timelines for root cause analysis (RCA).
- Tracing issues across layers with distributed tracing tools.
Automation and Remediation
- Triggering automated scripts or workflows from incidents.
- Integration with ITSM systems (ServiceNow, Jira).
- Use cases: self-healing, scaling, traffic rerouting.
Open Source and Commercial AIOps Platforms
- Overview of tools: Prometheus, Grafana, ELK, Moogsoft, Dynatrace.
- Evaluation criteria for selecting an AIOps platform.
- Demo and hands-on with a selected stack.
Summary and Next Steps
Requirements
- A solid understanding of IT operations and system monitoring concepts.
- Experience with monitoring tools or dashboards.
- Familiarity with basic log and metric formats.
Audience
- Operations teams responsible for infrastructure and applications.
- Site Reliability Engineers (SREs).
- IT monitoring and observability teams.
14 Hours