Tech Lead, AI Compute Infrastructure
On-site · San Francisco, California, United States or Toronto, Ontario, Canada
Job Summary
Tech Lead for AI Compute Infrastructure to build scalable platform powering HeyGen's generative video models; responsibilities include designing and implementing robust, efficient compute infrastructure; optimizing GPU utilization across thousands of devices; developing large-scale AI job frameworks for multi-modal, high-volume data ingestion, distributed model training, and continuous evaluation; enhancing observability with tracing and visualization tools; accelerating pipelines by integrating CUDA kernels and distributed training libraries; managing infrastructure with modern cloud and container tech (Kubernetes, Ray) to enable elastic, cost-efficient scaling.
Required Qualifications
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
- 5+ years of full-time industry experience in large-scale MLOps, AI infrastructure, or HPC systems
- Experience with data frameworks and standards like Ray, Apache Spark, LanceDB
- Strong proficiency in Python and C++
- Deep understanding of Kubernetes and Ray
- Experience with PyTorch, TensorFlow, or JAX
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.