About RapidFire AI
RapidFire AI is a cutting-edge deep tech startup specializing in scaling Machine Learning solutions. We are dedicated to empowering customers to effortlessly scale their AI workloads, ensuring they stay at the forefront of innovation in their industries.
About the Role
We are seeking a highly motivated and skilled Machine Learning Engineer to join our growing team. In this role, you will be responsible for developing distributed infrastructure for Deep Learning (DL) applications on the cloud, as well as contributing towards the design of newer features. You will collaborate closely with other developers and customer-facing personnel to ensure a seamless product experience.
Responsibilities:
- Design, develop, deploy, and maintain large-scale DL infrastructure software, following best practices and SE guidelines
- Contribute to designing efficient distributed systems that can scale DL computations, be modular and fault tolerant
- Automate the set up, launch, and orchestration of end-to-end training and experimentation pipelines written with PyTorch, Tensorflow, or KerasUse and extend libraries like FSDP, DDP, DeepSpeed, and GPipe to train DL models across multiple GPUs
- Use tools like Pandas and Dask to handle large multimodal datasets
- Troubleshoot code and fix bugs to ensure smooth functioning of the applicationMonitor and troubleshoot cluster resource usage to ensure optimal performance
- Conform to continuous integration and continuous delivery (CI/CD) pipeline standards for code deployment
- Communicate effectively with the wider team to ensure successful application development and deployment
- Collaborate with other developers to define and implement cloud infrastructure strategies for DL applications
- Stay up-to-date with the latest advancements in DL and AI technologies and best practices
Required Skills:
- 4+ years programming experience with PythonProven experience as an ML Engineer working on deploying production model training and/or inferenceExcellent knowledge of the DL tools PyTorch and TensorFlow
- Excellent knowledge and experience of using DL systems libraries such as FSDP, DDP, DeepSpeed, and GPipe
- Familiarity with LLMs, finetuning, and associated conceptsFamiliarity with operating systems concepts, memory management, networking, and cloud computing
- Familiarity with AWS infrastructure components like EC2, S3, EBS, EFS, EKS, and LambdaBasic experience with version control systems (e.g., Git) and collaborative development workflows
- Understanding of CI/CD methodologies and toolsExcellent communication and collaboration skills
- Ability to work independently and as part of a team
- Strong problem-solving and analytical skills
- A passion for learning and staying updated with the latest ML technologies
Nice to Have:
- Familiarity with Docker and Kubernetes to integrate code with underlying layers of deployment
- Experience working on AWS or other public cloud providers
- Experience with ML usability tools such as MLFlow, W&B, or AWS Sagemaker