Overview
Role involves designing, deploying, and operating large-scale GPU infrastructure for AI workloads.
Ideal candidate should have 5+ years in cloud-native development with strong Kubernetes experience.
remotemidfull-timeEnglishKubernetesDockerPrometheusGrafanaAWSGCPAzureHelm
Locations
Requirements
Bachelor's degree in relevant field 3+ years in system engineering or DevOps 5+ years in cloud-native development or AI engineering 2+ years in Kubernetes multi-cluster management Familiarity with Kubernetes ecosystem Proficient in Docker and containerization Experience with monitoring tools like Prometheus and Grafana Hands-on experience with cloud platforms like AWS, GCP, or Azure
Responsibilities
Build and operate large-scale GPU clusters Conduct performance testing of GPU clusters Deploy large models across multi-cluster environments Participate in GPU cluster scheduling and optimization Build a unified multi-cluster management system Coordinate with IDC providers for GPU clusters
Benefits
Flexible remote work environment Opportunity to work on cutting-edge technologies Collaboration with experts from leading institutions Visionary team aiming to redefine AI infrastructure