Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language, audio, and entertainment.

About The Role

We are looking for a Senior High Performance Computing Engineer to help us operate the GPUs, network and filesystem in our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, Infiniband, NVIDIA deepops, Ethernet networking and related tools are a big plus. You should be comfortable performing some amount of hardware configuration.

You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 20PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.

About The Role

A day in the life:

Manage private large high-end GPU clusters

Responsible for full lifecycle of physical systems including deployments of new hardware, operations, triage and troubleshooting

Configure and maintain network switches (Tomahawk Ethernet, Mellanox Infiniband)

Configure and maintain MAAS, Ceph, Slurm and Kubernetes

Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices

Configure and maintain network, e.g. Layer 3 networking

Learn about new tools and deploy them

You might be a great fit if you have:

Strong background in high performance computing

Experience with with on-premises Data Center operations and technologies

Experience in managing a large hardware cluster

Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code

Experience in designing, deploying, and maintaining production-grade machine learning systems at scale

Familiarity with GPU utilization for machine learning workloads and optimization techniques

Experience with managing firmware / systems updates for systems, e.g. on SuperMicro

The ability to solve problems and to learn new techniques is key.

Related Jobs

Anthropic

1 month ago

Performance Engineer

San Francisco, CA | New York City, NY | Seattle, WA

machine learning GIS research GPT

Writer

3 weeks ago

Performance engineer

New York City, NY (hybrid)

RAG LLM generative ai python+4 more

Writer

3 weeks ago

Performance engineer

San Francisco, CA (hybrid)

RAG LLM generative ai python+4 more

xAI

2 days ago

High-Performance Networking Engineer - Supercomputing

San Francisco & Palo Alto, CA

AGI research

Nuro

4 weeks ago

Sensor Performance Engineer

Mountain View, California (HQ)

robot python image processing

Upgrade Your Profile With Professional Headshots

High Performance Computing Engineer

A day in the life:

You might be a great fit if you have:

Share this job opportunity

Related Jobs

Performance Engineer

Performance engineer

Performance engineer

High-Performance Networking Engineer - Supercomputing

Sensor Performance Engineer