At Convergence, we're transforming the way AI integrates into our daily lives. Our team is developing the next generation of AI agents that don't just process information but take actions, learn from experience, and collaborate with humans. By introducing Large Meta Learning Models (LMLMs) that integrate memory as a core component, we're enabling AI to improve continuously through user feedback and acquire new skills during real-time use.
We believe in freeing individuals and businesses from mundane, repetitive tasks, allowing them to focus on innovative and creative work that truly matters. Our personalised AI assistant, proxy, collaborates with users to enhance productivity and creativity. With a $12 million pre-seed funding from Balderton Capital, Salesforce Ventures, and Shopify Ventures, we're poised to make a significant impact in the AI space. Join us in shaping the future of human-AI collaboration and be part of our mission to transform the AI landscape.
ResponsibilitiesDesign, implement, and maintain our ML-focused cloud infrastructure on GCP using Infrastructure as Code (Terraform)
Build and manage HPC clusters with Slurm for distributed ML workloads, focusing on GPU/TPU utilisation and job scheduling
Develop and maintain ML pipeline automation tools and ML-specific CI/CD workflows in Python
Design and optimise data storage solutions for ML datasets, model artefacts, and feature stores
Implement comprehensive monitoring, logging, and alerting solutions for ML model performance and infrastructure health
Collaborate with ML engineers and data scientists to provide robust infrastructure for model training and deployment
Lead and implement security best practices for ML systems, including model security and data protection
3+ years of experience in ML infrastructure or ML platform engineering
Strong proficiency in Python for ML pipeline automation and tooling
Extensive experience with Slurm cluster management for large-scale ML workloads
Proven track record with Terraform and Infrastructure as Code for ML environments
Solid understanding of GCP's ML-specific services (Vertex AI, AI Platform, etc.)
Experience with distributed training systems and model serving infrastructure
Experience with ML observability tools and performance monitoring
Excellent problem-solving skills with a focus on ML system reliability and optimisation
Experience scaling large language model (LLM) infrastructure
Knowledge of ML-specific orchestration tools (e.g., MLflow, Ray)
Experience with high-performance computing for ML training
Contributions to ML infrastructure-related open-source projects
Experience with GPU/TPU cluster management and optimisation
Background in ML operations (MLOps) or AI reliability engineering
Familiarity with vector databases and efficient embedding storage/retrieval
Be at the cutting edge of AI and LLM technology
Work on challenging problems that impact users' daily lives
Collaborative and innovative work environment
Opportunities for professional growth and learning
Competitive salary and benefits package