Kubernetes for AI Workloads: Best Practices

Building and deploying AI systems is a complex affair in terms of performance, managing scalability. To achieve glitch free performance and scalability Kubernetes technology is a preferred choice.  Scaling of inference points, management of GPU resources, bringing automation and structure in AI machine learning pipelines out of chaos is all achieved using kubernetes architecture. 

With kubernetes infrastructure provisioning becomes easier, resources are allocated automatically and services scalability is achieved on demand making systems efficient and resilient at the same time. A stable platform for AI development and deployment can be achieved with proper provisioning and deployment.

In today’s article we will understand about Kubernetes for AI Workloads, what is Kubernetes architecture and how it works for AI workloads, and what are best practices to manage AI workloads in Kubernetes. 


Kubernetes 

Kubernetes container orchestration automates deployment, scale and manages applications in distributed environments. When Kubernetes was initially developed it was aimed at streamlining infrastructure operations but later it became a core platform to train and serve machine learning AI models at a scale. Kubernetes abstracts underlying hardware and the concern is no longer on the platform on which machine is running rather how the system will behave. In declarative state, desired state for infrastructure or system can be defined and in imperative state direct commands can be issued such as create, delete etc.

AI model training is one aspect of AI functioning, it also involves data pre-processing, evaluation and deployment with its own resource requirements involving multi-GPU training, re-training, Updation, versioning etc. All this complexity is well handled by Kubernetes as it enables infrastructure as a code. Processes are isolated from containers which let each pipeline stage run independently without impacting other processes. Flexibility in resource scheduling, running GPU hungry jobs directly and scaling inference workloads automatically is possible due to Kubernetes. 

Kubernetes Architecture

Kubernetes architecture has three elements – Clusters, Nodes and Pods. 

  • Kubernetes Clusters – are a full set of nodes managed by the Kubernetes control plane. System health checks and Pod scheduling is taken care of by control plane. For user commands and automated tasks API is the main interface. The nodes are placed on best available nodes as per current capacity and constraints by scheduler. The controller manager looks for discrepancy between desired and current state – in the event Pod is crashed or becomes unavailable, Kubernetes systems back to desired state. Kubernetes cluster data is stored in etcd as a single source of truth which is a distributed key value store. New nodes are added or rescheduled as required. 
  • Kubernetes Node – Kubernetes Pods run on nodes which can be physical or virtual machines in a cluster. Kubelet agent runs on each node to ensure pods are running as expected and inform back to the cluster control plane. The containers can be started, stopped or restarted as required. Nodes in GPU powered workloads use a device plugin to reveal GPUs available for Kubernetes. 
  • Kubernetes PODs – hosts one or more containers which share the same network and storage and is the smallest unit of kubernetes architecture. Pods run on a single cluster typically in machine learning, – a model, data pre-processing step or an API. Monitoring and management of each component of the pipeline is done independently.

How Kubernetes work for AI workloads

AI model training can be structured and streamlined as a controlled workflow using kubernetes which can be monitored or scaled as required. The steps are defined in the pipeline either as a job or pod. The training job goes to kubernetes to ensure its completion and re-run if it fails. 

Inference is managed as deployment. Model replicas are maintained by API service setup, traffic is monitored by kubernetes and replicas scale up/down accordingly and restoration is automatic if replica fails. If multiple versions of AI models are to be executed, then canary deployments are supported for smooth transitions with minimal disruptions. 

KuberFlow defines a complete pipeline end to end from Pre-processing to training to deployment in orchestrated manner in advanced workflows. As each component runs its own container image which kubernetes manages, pipeline elements can be reused elsewhere. 

When training distribution happens across multiple systems each training process will run in a separate Pod and fault tolerance and orchestration is handled by kubernetes. 

Best Practices for Managing Kubernetes AI workloads

Running AI workloads on Kubernetes requires a careful planning and execution w.r.t. resource distribution, observability, scalability and automation. A basic Kubernetes setup can be a launch pad for AI but it needs implementation of best practices for a production ready setup. 

  • Comprehensive Observability – Loading profile may vary for ML workloads which makes them unpredictable at times.  Just enabling logging is not enough; a full observability stack is what is needed here. Tools like Prometheus can be used to collect metrics from kubelet, node-exporter, cAdvisor etc. Grafana tool is used for visualization, dashboards and logging can be handled by ELK (elastic search) stack and this should cover GPU monitoring. 
  • Resource Management – different AI models have different demands on infrastructure, some of them are GPU intensive, some are memory intensive and some require a stable network with minimal latency. If resource requests and limits are clearly defined, then efficient allocation of tasks can happen. While establishing scheduling for GPU, device plugin and node labelling using taint and affinity is essential. 
  • Cluster space needs to be designed keeping in mind isolation and fairness. Different namespaces for teams and projects, critical workloads prioritization over less critical workloads, quota enforcement on CPU, Pod count, storage volumes and memory are some of the elements to be considered during the design.  
  • Scaling – For AI inference horizontal Pod Autoscaler works well. For complex AI workloads bespoke kubernetes controllers and custom metrics would be required. It is a good practice to separate static and scalable components in a pipeline to simplify orchestration logic and achieve resource optimization. 
  • Robust CI/CD – production maturity requires a robust CI/CD pipeline to automate container building, validation of AI model, deployment in stages, production deployments and using Helm charts and Kustomize for configurations. 

ABOUT THE AUTHOR


Leave a Comment

Your email address will not be published. Required fields are marked *

Shopping Cart