Overview
Role involves designing and maintaining software for AI infrastructure and ML workloads.
Ideal candidate has experience in large-scale ML model training and optimizing distributed training performance.
remotefull-timePyTorchJAXKubernetesMLFlowKubeflow
Locations
Requirements
Experience with ML models Software development skills
Responsibilities
Design and maintain software solutions Optimize job scheduling systems Build software interfaces for cluster management Implement data pipeline optimizations Develop APIs and services Write libraries for distributed training
Benefits