Platform Engineer, MLOps

Jobgether

Overview

Role focused on building and maintaining infrastructure for AI/ML development and production.

Ideal candidate should have 5+ years of experience in managing core infrastructure for large-scale systems.

hybridseniorfull-timeKubernetesDockerGCPAWSAzureTerraformPythonbashgitPrometheusGrafanaPyTorch

Locations

United States, California, San Francisco

Requirements

5+ years experience in infrastructure
Deep experience with Kubernetes and Docker
Expertise in cloud platforms (GCP, AWS, Azure)
Proficiency in Python and Bash scripting
Familiarity with ML frameworks like PyTorch
Experience with monitoring tools like Prometheus

Responsibilities

Develop and manage CI/CD pipelines
Set up and monitor logging and observability systems
Operate and optimize Kubernetes clusters
Manage containerization using Docker
Ensure high availability of training environments
Support MLOps infrastructure performance
Troubleshoot complex systems

Benefits

Generous paid time off
Comprehensive medical, dental, and vision insurance
12 weeks paid parental leave
Fertility and family planning support
Flexible spending accounts
Annual stipends for home office setup
Competitive salary and stock options
401(k) plan