ZEISS AI Kubernetes GPU On-Premises Cluster
Overview
At ZEISS Meditec AG, within the AI Medical Technology R&D department, I architected and operated a dedicated on-premises GPU cluster to support resource-efficient and cost-optimized AI model development and experimentation.
The cluster, composed of five high-performance computing nodes, enabled data science and architecture teams to train, deploy, and iterate advanced AI models for medical imaging and diagnostics.
Role & Responsibilities
As Senior Cloud / MLOps Engineer, I was responsible for the end-to-end operation, configuration, and optimization of the on-premises GPU Kubernetes environment.
- Operated and maintained GPU nodes, including hardware, OS, and software components
- Configured Linux (Ubuntu) systems, networking, SSH, and enterprise security hardening
- Deployed and managed Kubernetes and Docker for scalable AI workloads
- Delivered internal enablement sessions such as “Train Your Model On-Prem” workshops
- Monitored GPU utilization, system performance, and optimized resource allocation
- Integrated machine learning pipelines into the GPU cluster infrastructure
- Implemented RBAC policies and secure communication for multi-user environments
- Ensured stability, scalability, and security of on-premises AI operations
Applied Methods & Tools
- Infrastructure Management: Provisioning and configuration for GPU-based systems
- Containerization: Docker for reproducible machine learning environments
- Orchestration: Kubernetes for scheduling, scaling, and workload distribution
- Command-Line Operations: YAML, Dockerfiles, and shell scripting workflows
- Monitoring: htop, kubectl top, nvidia-smi, and tmux for performance visibility
- Knowledge Sharing: Conducted workshops and hands-on enablement sessions for R&D teams
- Version Control: Git for collaboration, versioning, and reproducible builds
Applied Technologies
- Operating System: Ubuntu Server 18.04 LTS for GPU node management
- Containerization: Docker 24.x for environment isolation
- Orchestration: Kubernetes (KubeAdm) for cluster deployment and control
- GPU Frameworks: NVIDIA CUDA Toolkit for optimized deep learning workloads
- Monitoring Tools: htop, kubectl top, nvidia-smi, tmux for real-time metrics
- Networking & Security: Linux-based enterprise configuration with hardened SSH and RBAC
- Version Control: Git for collaborative development and workflow management
Impact
- Enabled AI R&D teams to efficiently train and deploy deep learning models on dedicated hardware
- Delivered a self-managed, high-performance GPU infrastructure with secure multi-user access
- Improved resource utilization and cost efficiency through optimized GPU scheduling
- Empowered data scientists via on-prem enablement workshops and hands-on guidance
- Established a scalable foundation for future hybrid (on-prem and cloud) AI workflows
Summary
The ZEISS AI Kubernetes GPU Cluster project demonstrated the successful implementation of enterprise-grade MLOps practices within an on-premises GPU environment.
By combining Kubernetes orchestration, Docker containerization, and CUDA acceleration, the platform provided ZEISS R&D teams with a robust, secure, and efficient infrastructure for advancing AI-driven medical technologies.