zeiss logo

AI Kubernetes GPU On-Premises Cluster

Deployed and operated an on-premises GPU Kubernetes cluster for AI model training at ZEISS Meditec, enabling high-performance, cost-efficient, and secure AI research and experimentation.

1. August 2023
Empowered AI R&D teams to train, deploy, and experiment with deep learning models on dedicated GPU infrastructure
#Onpremise #Linux #MLOps #AI #Kubernetes #Data #Security #Automation #Monitoring

Technologies Used

Kubernetes
Docker
Ubuntu
NVIDIA CUDA
YAML
Shell
Git
KubeAdm
htop
kubectl top
nvidia-smi
tmux
Ubuntu Server 18.04 LTS
Linux
SSH
RBAC

ZEISS AI Kubernetes GPU On-Premises Cluster

Overview

At ZEISS Meditec AG, within the AI Medical Technology R&D department, I architected and operated a dedicated on-premises GPU cluster to support resource-efficient and cost-optimized AI model development and experimentation.
The cluster, composed of five high-performance computing nodes, enabled data science and architecture teams to train, deploy, and iterate advanced AI models for medical imaging and diagnostics.


Role & Responsibilities

As Senior Cloud / MLOps Engineer, I was responsible for the end-to-end operation, configuration, and optimization of the on-premises GPU Kubernetes environment.

  • Operated and maintained GPU nodes, including hardware, OS, and software components
  • Configured Linux (Ubuntu) systems, networking, SSH, and enterprise security hardening
  • Deployed and managed Kubernetes and Docker for scalable AI workloads
  • Delivered internal enablement sessions such as “Train Your Model On-Prem” workshops
  • Monitored GPU utilization, system performance, and optimized resource allocation
  • Integrated machine learning pipelines into the GPU cluster infrastructure
  • Implemented RBAC policies and secure communication for multi-user environments
  • Ensured stability, scalability, and security of on-premises AI operations

Applied Methods & Tools

  • Infrastructure Management: Provisioning and configuration for GPU-based systems
  • Containerization: Docker for reproducible machine learning environments
  • Orchestration: Kubernetes for scheduling, scaling, and workload distribution
  • Command-Line Operations: YAML, Dockerfiles, and shell scripting workflows
  • Monitoring: htop, kubectl top, nvidia-smi, and tmux for performance visibility
  • Knowledge Sharing: Conducted workshops and hands-on enablement sessions for R&D teams
  • Version Control: Git for collaboration, versioning, and reproducible builds

Applied Technologies

  • Operating System: Ubuntu Server 18.04 LTS for GPU node management
  • Containerization: Docker 24.x for environment isolation
  • Orchestration: Kubernetes (KubeAdm) for cluster deployment and control
  • GPU Frameworks: NVIDIA CUDA Toolkit for optimized deep learning workloads
  • Monitoring Tools: htop, kubectl top, nvidia-smi, tmux for real-time metrics
  • Networking & Security: Linux-based enterprise configuration with hardened SSH and RBAC
  • Version Control: Git for collaborative development and workflow management

Impact

  • Enabled AI R&D teams to efficiently train and deploy deep learning models on dedicated hardware
  • Delivered a self-managed, high-performance GPU infrastructure with secure multi-user access
  • Improved resource utilization and cost efficiency through optimized GPU scheduling
  • Empowered data scientists via on-prem enablement workshops and hands-on guidance
  • Established a scalable foundation for future hybrid (on-prem and cloud) AI workflows

Summary

The ZEISS AI Kubernetes GPU Cluster project demonstrated the successful implementation of enterprise-grade MLOps practices within an on-premises GPU environment.
By combining Kubernetes orchestration, Docker containerization, and CUDA acceleration, the platform provided ZEISS R&D teams with a robust, secure, and efficient infrastructure for advancing AI-driven medical technologies.