(Location: Remote US)
About role:
We are currently looking for a skilled Sr. ML Platform Engineer with specialized focus on Cloud Infrastructure that includes API development to facilitate seamless integration and interaction between cloud-based services and High-Performance Computing (HPC) environments. The successful candidate will play a pivotal role in designing and implementing APIs that enable efficient communication and data exchange between cloud platforms and HPC systems.
Responsibilities:
- Design, develop, and maintain robust APIs that facilitate communication and data exchange between cloud-based services, particularly AWS, and HPC environments.
- Collaborate with cross-functional teams to understand the unique requirements of both cloud based services and HPC systems, ensuring that the APIs developed meet the specific needs of these environments.
- Implement best practices for API design, including security, scalability, and performance optimization to ensure efficient interaction between cloud services and HPC clusters.
- Utilize services such as Cloudflare to enhance API performance, security, and reliability in the cloud-to-HPC communication, optimizing for speed and resilience.
- Work closely with HPC engineers to identify and address integration challenges, striving for seamless connectivity between diverse systems and cloud-based platforms.
- Drive innovation by proposing and implementing new API strategies, enhancing the efficiency and functionality of data exchange between AWS, Cloudflare workers, on-premise HPC environments.
- Create comprehensive documentation and provide training to internal teams on the use and integration of developed APIs, focusing on AWS and Cloudflare environments.
- Monitor API performance and address issues related to data transfer, ensuring reliability and consistent operation between AWS, Cloudflare, and HPC systems (Slurm/AWS HyperPod).
- Collaborate with the security team to ensure that the APIs comply with industry standards and best practices for data privacy and protection, especially in AWS and Cloudflare environments.
- Participating in incident management and root cause analysis to improve system reliability.
- Build containers with REST APIs for Gen AI functionality and host them on AWS and Azure.
Requirements:
- 8 years of experience in cloud computing, API development, and a deep understanding of High-Performance Computing environments, particularly in an AWS setting.
- Strong knowledge of HPC cluster management and job scheduling with Slurm and AWS HyperPod.
- Proficiency in programming languages such as Python and Typescript, essential for API development and integration within AWS and/or Cloudflare worker environments.
- Demonstrated expertise in API design, implementation, and maintenance, ensuring security and performance best practices within AWS and Cloudflare.
- Knowledge of containerization technologies (e.g., Docker, Kubernetes) for deployment of APIs within AWS, Cloudflare, and HPC systems.
- Experience with automating CI/CD pipelines.
- Familiarity with authentication and authorization protocols (e.g., OAuth, JWT) to ensure secure data exchange between AWS, Cloudflare, and HPC environments.
- Strong problem-solving skills and the ability to troubleshoot complex issues related to API integrations in a hybrid cloud-HPC setup, particularly in AWS and Cloudflare environments.
- Excellent communication and collaboration skills to work effectively with diverse teams and stakeholders in AWS and Cloudflare ecosystems
Compensation
The salary range for this role is between $130,000 and $190,000. Individual pay within the range is based on factors like job-related skills and experience. Total compensation also includes stock options and benefits.
Equal Employment Opportunity:
We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.