Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language and beyond.

About The Role

We are looking for a Senior Infrastructure Engineer / System Administrator to help us operate our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, OPNSense, networking and related tools is a big plus. You should be comfortable performing some amount of hardware configuration.

You will have the opportunity to work with the latest NVIDIA H100 GPUs, many PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.

About The Role

A day in the life:

Manage private large high-end GPU clusters

Responsible for full lifecycle of physical systems including deployments of new hardware, operations, triage and troubleshooting

Configure and maintain network switches (Tomahawk TH3, Mellanox Infiniband)

Configure and maintain MAAS (metal as a service), Ceph, and Slurm

Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices

Configure and maintain network and security tools, including VPN, VLAN, DHCP, SSO, MFA

Learn about new tools and deploy them

You might be a great fit if you have:

Strong background in system operations, including Slurm, Ansible, MAAS, Ceph, OPNsense and Kubernetes

Experience with with on-premises Data Center operations and technologies

Experience in managing a large hardware cluster

Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code

Experience in designing, deploying, and maintaining production-grade machine learning systems at scale

Familiarity with GPU utilization for machine learning workloads and optimization techniques

Experience with managing firmware / systems updates for systems, e.g. on SuperMicro

The ability to solve problems and to learn new techniques is key.

Related Jobs

Groq

3 weeks ago

Senior Site Reliability Engineer, Observability

Remote

AGI typescript

C3 AI

2 weeks ago

Site Reliability Engineer

Bengaluru, India

AGI generative ai python aws+1 more

PathAI

1 week ago

Senior Site Reliability Engineer

Remote

machine learning AGI python

Aledade

1 week ago

Senior Engineering Manager- Site Reliability (copy)

AGI python

Perplexity AI

3 weeks ago

Site Reliability Engineer - Serbia

Belgrade Serbia

python postgres database aws

Upgrade Your Profile With Professional Headshots

Senior System Administrator / Site Reliability Engineer (copy)