POSTED Sep 4

Senior Machine Learning Platform Engineer

at Stability AIUnited States

Share:

(Location: Remote US)

About role: 

We are currently looking for a skilled Sr. ML Platform Engineer with specialized focus on Cloud Infrastructure that includes API development to facilitate seamless integration and interaction between cloud-based services and High-Performance Computing (HPC) environments. The successful candidate will play a pivotal role in designing and implementing APIs that enable efficient communication and data exchange between cloud platforms and HPC systems.

Responsibilities:

  • Design, develop, and maintain robust APIs that facilitate communication and data exchange between cloud-based services, particularly AWS, and HPC environments
  • Collaborate with cross-functional teams to understand the unique requirements of both cloud based services and HPC systems, ensuring that the APIs developed meet the specific needs of these environments
  • Implement best practices for API design, including security, scalability, and performance optimization to ensure efficient interaction between cloud services and HPC clusters
  • Utilize services such as Cloudflare to enhance API performance, security, and reliability in the cloud-to-HPC communication, optimizing for speed and resilience
  • Work closely with HPC engineers to identify and address integration challenges, striving for seamless connectivity between diverse systems and cloud-based platforms
  • Drive innovation by proposing and implementing new API strategies, enhancing the efficiency and functionality of data exchange between AWS, Cloudflare workers, on-premise HPC environments
  • Create comprehensive documentation and provide training to internal teams on the use and integration of developed APIs, focusing on AWS and Cloudflare environments
  • Monitor API performance and address issues related to data transfer, ensuring reliability and consistent operation between AWS, Cloudflare, and HPC systems (Slurm/AWS HyperPod)
  • Collaborate with the security team to ensure that the APIs comply with industry standards and best practices for data privacy and protection, especially in AWS and Cloudflare environments
  • Participating in incident management and root cause analysis to improve system reliability
  • Build containers with REST APIs for Gen AI functionality and host them on AWS and Azure

Requirements:

  • 8 years of experience in cloud computing, API development, and a deep understanding of High-Performance Computing environments, particularly in an AWS setting
  • Strong knowledge of HPC cluster management and job scheduling with Slurm and AWS HyperPod
  • Proficiency in programming languages such as Python and Typescript, essential for API development and integration within AWS and/or Cloudflare worker environments
  • Demonstrated expertise in API design, implementation, and maintenance, ensuring security and performance best practices within AWS and Cloudflare
  • Knowledge of containerization technologies (e.g., Docker, Kubernetes) for deployment of APIs within AWS, Cloudflare, and HPC systems
  • Experience with automating CI/CD pipelines
  • Familiarity with authentication and authorization protocols (e.g., OAuth, JWT) to ensure secure data exchange between AWS, Cloudflare, and HPC environments
  • Strong problem-solving skills and the ability to troubleshoot complex issues related to API integrations in a hybrid cloud-HPC setup, particularly in AWS and Cloudflare environments
  • Excellent communication and collaboration skills to work effectively with diverse teams and stakeholders in AWS and Cloudflare ecosystems

Compensation

The salary range for this role is between $130,000 and $190,000. Individual pay within the range is based on factors like job-related skills and experience. Total compensation also includes stock options and benefits

Equal Employment Opportunity:

We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.

Please mention that you found this job on Moaijobs, this helps us get more companies to post here, thanks!

Related Jobs

Invisible
Senior Software Engineer - Platform
Worldwide - Remote
ScaleAI
Machine Learning Engineer, Fraud
San Francisco, CA; New York, NY
Lamini AI
Machine Learning Engineer - Customer Facing
ScaleAI
Machine Learning Research Engineer, Agent Applications
San Francisco, CA; Seattle, WA; New York, NY
Anthropic
Machine Learning Systems Engineer, RL Engineering
San Francisco, CA