Nuance Labs3 days ago

Member of Technical Staff — Model Optimization and Inference (New Grad)

Nuance Labs

$200,000–$300,000 year

On-site · Seattle, Washington, United States

Seattle, Washington, United StatesOn-siteFull Time$200,000–$300,000 yearEntry LevelMasters DegreeStartup

Type

Full Time

Level

Entry Level

Education

Masters Degree

Company size

Startup

Job Summary

Member of Technical Staff focused on Model Optimization and Inference for a full-duplex multimodal real-time AI system. Responsibilities include end-to-end inference optimization across the model stack (LLMs, audio models, diffusion components), implementing and tuning KV cache strategies for long-context conversations, extending inference serving frameworks (vLLM, SGLang, TensorRT-LLM), profiling and benchmarking latency/throughput, building internal tooling for profiling and test harnesses, accelerating diffusion model inference with caching and kernel optimizations, applying quantization to reduce memory footprint while preserving quality, and collaborating with research and infrastructure to ship optimized models from day one.

Required Qualifications

BS, MS, or PhD in CS, ML, or a related field — completed or in the final stretch
Strong fundamentals in LLM inference or ML systems
Exposure to inference serving frameworks (e.g., vLLM, SGLang, TensorRT-LLM)
Strong Python and PyTorch skills
A systematic approach to profiling and optimization
Curiosity about diffusion inference, quantization, or other inference-time acceleration techniques

Desired Qualifications

BS, MS, or PhD in CS/ML or a related field completed or in the final stretch
Strong fundamentals in LLM inference or ML systems
Exposure to inference serving frameworks (e.g., vLLM, SGLang, TensorRT-LLM)
Strong Python and PyTorch skills
Familiarity with CUDA or Triton is a plus
Experience with model optimization, latency reduction, and real-time inference
Open-source Contributions or internship/research in ML systems welcome
Curiosity about diffusion inference and quantization techniques
Bonus: internship or research in LLM inference or model serving
Experience delivering under real-time latency SLAs
Willingness to grow in a fast-paced, research-to-production environment

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started