Site Reliability Engineer

FluidStack

Overview

Role involves ensuring the reliability and performance of GPU cloud infrastructure.

Ideal candidate has 2+ years of relevant experience and strong communication skills.

remotemidpermanentfull-timeEnglishKubernetesAnsibleTerraformGoPythonBash

Locations

United States, California, San Francisco
United States, California, London

Requirements

2+ years of SRE, DevOps, Sysadmin, or HPC experience
Experience deploying and operating Kubernetes and/or SLURM clusters
Strong engineering background in relevant fields

Responsibilities

Ensure reliability and performance of GPU cloud
Deploy clusters of GPUs
Debug production issues
Build internal tooling for deployment
Participate in on-call rotation

Benefits

Competitive compensation package
Retirement plan
Health, dental, and vision insurance
Generous PTO policy
Access to WeWork for remote locations