Our Mission
Our mission is to solve the most important and fundamental challenges in AI and Robotics to enable future generations of intelligent machines that will help us all live better lives.
Machine Learning Operations (ML-Ops) Engineers build infrastructure that supports the entire lifecycle of Machine Learning (ML) projects from development to scaling and to deployment. If you have a passion for building the foundation that enables robotics research and engineering, you will want to join us!
What You Will Do
Design, develop, and maintain company-wide platforms and tooling that utilize Kubernetes infrastructure to enable machine learning and data processing applicationsEnable self-service access to ML-compute for our on-prem and cloud compute clusters, including support for job scheduling, workload scalability and workload fault toleranceEnhance observability across ML applications through integrations with tools and services such as FluentD, Prometheus, Grafana and DataDogIntegrate ML applications with experiment tracking and management services like Weights and BiasesElevate code quality and champion best practices in our engineering processesCollaborate with Machine Learning Engineers, Data Engineers, DEVOPs engineers and researchers to build scalable solutions that improve engineering and research velocity.,
What You Will Bring
BS or MS in Computer Science, Engineering, or equivalent3+ years of experience in an MLOPs, DevOps, ML Engineering or software engineering roleStrong hands-on experience deploying and managing applications running on KubernetesExperience developing MLOPS platforms to manage the lifecycle of ML experiments; including one or more of data and artifact management, reproducibility, fault-tolerance, experiment tracking and model servingExperience with Docker and Python environment management tools such as pip, poetry, uv or similarProficient in software practices such as version control (Git), CI/CD (Github Actions, ArgoCD), Infrastructure as Code(Terraform).,
Extra Skills We Value
Experience with Kueue, or similar job scheduling mechanismsExperience with workflow orchestration tools such as Airflow, Metaflow, Argo Workflows or similarHands-on experience deploying and managing cloud infra on platforms like GCP and AWSExperience with hybrid-cloud compute and data environmentsExperience with Ray, Pytorch Lightning or similar scalable AI/ML platformsExperience with application and system, logging with tools and services like FluentD, Prometheus, Grafana and DataDog or similarExperience with Bazel build tool or similarExperience with ML model serving frameworks such as Torchserve, ONNX runtime or similarExperience working with research teams in an academic or industrial environment.We provide equal employment opportunities to all employees and applicants for employment and prohibit discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.