ZEISS AI Kubernetes GPU On-Premises Cluster

Overview

At ZEISS Meditec AG, within the AI Medical Technology R&D department, I architected and operated a dedicated on-premises GPU cluster to support resource-efficient and cost-optimized AI model development and experimentation.
The cluster, composed of five high-performance computing nodes, enabled data science and architecture teams to train, deploy, and iterate advanced AI models for medical imaging and diagnostics.

Role & Responsibilities

As Senior Cloud / MLOps Engineer, I was responsible for the end-to-end operation, configuration, and optimization of the on-premises GPU Kubernetes environment.

Operated and maintained GPU nodes, including hardware, OS, and software components
Configured Linux (Ubuntu) systems, networking, SSH, and enterprise security hardening
Deployed and managed Kubernetes and Docker for scalable AI workloads
Delivered internal enablement sessions such as “Train Your Model On-Prem” workshops
Monitored GPU utilization, system performance, and optimized resource allocation
Integrated machine learning pipelines into the GPU cluster infrastructure
Implemented RBAC policies and secure communication for multi-user environments
Ensured stability, scalability, and security of on-premises AI operations

Applied Methods & Tools

Infrastructure Management: Provisioning and configuration for GPU-based systems
Containerization: Docker for reproducible machine learning environments
Orchestration: Kubernetes for scheduling, scaling, and workload distribution
Command-Line Operations: YAML, Dockerfiles, and shell scripting workflows
Monitoring: htop, kubectl top, nvidia-smi, and tmux for performance visibility
Knowledge Sharing: Conducted workshops and hands-on enablement sessions for R&D teams
Version Control: Git for collaboration, versioning, and reproducible builds

Applied Technologies

Operating System: Ubuntu Server 18.04 LTS for GPU node management
Containerization: Docker 24.x for environment isolation
Orchestration: Kubernetes (KubeAdm) for cluster deployment and control
GPU Frameworks: NVIDIA CUDA Toolkit for optimized deep learning workloads
Monitoring Tools: htop, kubectl top, nvidia-smi, tmux for real-time metrics
Networking & Security: Linux-based enterprise configuration with hardened SSH and RBAC
Version Control: Git for collaborative development and workflow management

Impact

Enabled AI R&D teams to efficiently train and deploy deep learning models on dedicated hardware
Delivered a self-managed, high-performance GPU infrastructure with secure multi-user access
Improved resource utilization and cost efficiency through optimized GPU scheduling
Empowered data scientists via on-prem enablement workshops and hands-on guidance
Established a scalable foundation for future hybrid (on-prem and cloud) AI workflows

Summary

The ZEISS AI Kubernetes GPU Cluster project demonstrated the successful implementation of enterprise-grade MLOps practices within an on-premises GPU environment.
By combining Kubernetes orchestration, Docker containerization, and CUDA acceleration, the platform provided ZEISS R&D teams with a robust, secure, and efficient infrastructure for advancing AI-driven medical technologies.

AI Kubernetes GPU On-Premises Cluster

Technologies Used