Skip to content

AI Platforms and CNCF Ecosystem

Modern AI systems rely on a complex ecosystem of tools, platforms, and hardware that work together to manage data, train models, deploy services, and scale workloads.
This chapter provides a high-level understanding of the major components that make up the AI platform landscape.

Example Tools

Category Tool Purpose
Workflow orchestration Airflow, Argo Schedule and manage jobs
Pipeline & MLOps Kubeflow End-to-end ML pipelines
Experiment tracking MLflow Track runs, params, metrics
Distributed data/compute Apache Spark Large-scale data processing

CNCF MLOps Toolchains

The Cloud Native Computing Foundation (CNCF) ecosystem provides open-source tools that cover the full machine learning lifecycle. These tools help teams automate workflows, track experiments, orchestrate training jobs, and deploy models at scale.

Airflow – Workflow Orchestration

  • A powerful scheduler for managing end-to-end ML pipelines
  • Defines tasks as DAGs (Directed Acyclic Graphs)
  • Commonly used for data ingestion, preprocessing, and batch ML workflows
  • Integrates with cloud storage, databases, and compute services

    AirFlow

MLflow – Experiment Tracking & Model Registry

  • Tracks experiments, hyperparameters, metrics, and artifacts
  • Provides a standardized format for packaging ML models (MLflow Models)
  • Includes a model registry for versioning and promoting models to production
  • Framework-agnostic (works with PyTorch, TensorFlow, XGBoost, etc.)

    MLflow

Kubeflow – ML on Kubernetes

  • A Kubernetes-native platform for training, serving, and managing ML models
  • Key components:
  • Kubeflow Pipelines – CI/CD for ML workflows
  • Katib – Automated hyperparameter tuning
  • KFServing / KServe – High-performance model serving
  • Ideal for teams using Kubernetes for large-scale ML workloads

    Kubeflow

Apache Spark – Distributed Data & ML Processing

  • Distributed data processing engine optimized for large datasets
  • Supports SQL, streaming data, and MLlib for scalable machine learning
  • Widely used for feature engineering, ETL, and batch training pipelines
  • Integrates with Delta Lake, Hudi, Iceberg for large-scale data lakes

    Apache Spark