Overview
Role focused on building and maintaining infrastructure for AI/ML development and production.
Ideal candidate should have 5+ years of experience in managing core infrastructure for large-scale systems.
hybridseniorfull-timeKubernetesDockerGCPAWSAzureTerraformPythonbashgitPrometheusGrafanaPyTorch+ 3 more
Locations
United States, California, San Francisco
Requirements
5+ years experience in infrastructure Deep experience with Kubernetes and Docker Expertise in cloud platforms (GCP, AWS, Azure) Proficiency in Python and Bash scripting Familiarity with ML frameworks like PyTorch Experience with monitoring tools like Prometheus
Responsibilities
Develop and manage CI/CD pipelines Set up and monitor logging and observability systems Operate and optimize Kubernetes clusters Manage containerization using Docker Ensure high availability of training environments Support MLOps infrastructure performance Troubleshoot complex systems
Benefits
Comprehensive medical, dental, and vision insurance 12 weeks paid parental leave Fertility and family planning support Flexible spending accounts Annual stipends for home office setup Competitive salary and stock options