About RapidFire AI

RapidFire AI is a cutting-edge deep tech startup specializing in scaling Machine Learning solutions. We are dedicated to empowering customers to effortlessly scale their AI workloads, ensuring they stay at the forefront of innovation in their industries.

About the Role

We are seeking a highly motivated and skilled Machine Learning Engineer to join our growing team. In this role, you will be responsible for developing distributed infrastructure for Deep Learning (DL) applications on the cloud, as well as contributing towards the design of newer features. You will collaborate closely with other developers and customer-facing personnel to ensure a seamless product experience.

Responsibilities:

Design, develop, deploy, and maintain large-scale DL infrastructure software, following best practices and SE guidelines
Contribute to designing efficient distributed systems that can scale DL computations, be modular and fault tolerant
Automate the set up, launch, and orchestration of end-to-end training and experimentation pipelines written with PyTorch, Tensorflow, or KerasUse and extend libraries like FSDP, DDP, DeepSpeed, and GPipe to train DL models across multiple GPUs
Use tools like Pandas and Dask to handle large multimodal datasets
Troubleshoot code and fix bugs to ensure smooth functioning of the applicationMonitor and troubleshoot cluster resource usage to ensure optimal performance
Conform to continuous integration and continuous delivery (CI/CD) pipeline standards for code deployment
Communicate effectively with the wider team to ensure successful application development and deployment
Collaborate with other developers to define and implement cloud infrastructure strategies for DL applications
Stay up-to-date with the latest advancements in DL and AI technologies and best practices

Required Skills:

4+ years programming experience with PythonProven experience as an ML Engineer working on deploying production model training and/or inferenceExcellent knowledge of the DL tools PyTorch and TensorFlow
Excellent knowledge and experience of using DL systems libraries such as FSDP, DDP, DeepSpeed, and GPipe
Familiarity with LLMs, finetuning, and associated conceptsFamiliarity with operating systems concepts, memory management, networking, and cloud computing
Familiarity with AWS infrastructure components like EC2, S3, EBS, EFS, EKS, and LambdaBasic experience with version control systems (e.g., Git) and collaborative development workflows
Understanding of CI/CD methodologies and toolsExcellent communication and collaboration skills
Ability to work independently and as part of a team
Strong problem-solving and analytical skills
A passion for learning and staying updated with the latest ML technologies

Nice to Have:

Familiarity with Docker and Kubernetes to integrate code with underlying layers of deployment
Experience working on AWS or other public cloud providers
Experience with ML usability tools such as MLFlow, W&B, or AWS Sagemaker

Related Jobs

Faculty

4 weeks ago