4 days ago

Senior Software Engineer - Ceph

Toronto
Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language, audio, and entertainment.

About The Role

We are looking for a Senior Software Engineer with deep expertise in managing Ceph for our deep learning datacenter in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Infiniband, NVIDIA deepops, Layer 3 networking and related tools are a big plus. You should be comfortable performing some amount of hardware configuration. 

You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 25PB of disk and over 5PB flash storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating Ceph and its integration with a broad range of infrastructure technologies and hardware systems.

You MUST have prior Ceph experience in order to qualify for the job. If you don't, please don't spam the ATS.
Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language, audio, and entertainment.

About The Role

We are looking for a Senior Software Engineer with deep expertise in managing Ceph for our deep learning datacenter in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Infiniband, NVIDIA deepops, Layer 3 networking and related tools are a big plus. You should be comfortable performing some amount of hardware configuration. 

You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 25PB of disk and over 5PB flash storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating Ceph and its integration with a broad range of infrastructure technologies and hardware systems.

You MUST have prior Ceph experience in order to qualify for the job. If you don't, please don't spam the ATS.

A day in the life:

  • Design, manage and maintain large storage arrays
  • Integrate them with Deep Learning infrasturcture
  • Support troubleshooting for MAAS, Slurm and Kubernetes as needed
  • Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices
  • Learn about new tools and deploy them
  • You might be a great fit if you have:

  • Strong background in maintaining Ceph clusters
  • Experience with high performance computing is highly desirable
  • Experience with with on-premises Data Center operations and technologies
  • Experience in managing a large hardware cluster
  • Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code
  • Experience with managing firmware / systems updates for systems, e.g. on SuperMicro
  • The ability to solve problems and to learn new techniques is key.

    Please mention that you found this job on MoAIJobs, this helps us grow. Thank you!

    Share this job opportunity

    Related Jobs

    Appen
    1 week ago

    Senior Software Engineer

    Tempus
    1 week ago

    Senior Software Engineer

    Chicago
    Tempus
    3 weeks ago

    Senior Software Engineer, DevX

    Chicago
    Otter
    2 weeks ago

    Senior Software Engineer, LLM

    Mountain View, CA
    Tempus
    2 weeks ago

    Senior Software Engineer, Next

    Chicago