Speech Research Intern-2
$93,600–$93,600 year
Remote · United States or Redmond, Washington, United States
Job Summary
PhD Research Intern role focusing on designing and evaluating speech-first models, including spoken language models that reason over audio and interact conversationally. Responsibilities include prototyping end-to-end speech dialogue systems, aligning speech encoders with text backbones via lightweight adapters, efficient speech tokenization and temporal compression for long-form audio, and building an evaluation harness covering ASR/ST/SLU and speech QA with streaming metrics. Projects involve prototyping a conversational SLM with a speech encoder and adapters, creating data recipes blending speech with instruction-following corpora, and shipping a minimal demo with streaming inference and logging. Required skills include PhD candidacy in CS/EE (or related), Python/PyTorch with GPU experience, knowledge of Transformers/SSMs, and experience in at least one area such as discrete speech tokens, modality alignment via adapters, or post-training/instruction tuning for speech tasks. Preferred qualifications cover experience with neural speech codecs/vocoders, multilingual or code-switching speech, robustness and safety evaluation, distributed training (FSDP/DeepSpeed) and tools like ESPnet, SpeechBrain, NVIDIA NeMo, and experience with PyTorch ecosystem tools (CUDA, torchaudio/librosa, ONNX/TensorRT). Location options include Redmond, WA or Remote with flexible scheduling; compensation is a $45 per hour rate (annualized to $93,600) with a stipend and opportunities to publish, mentor, and access GPU infrastructure.
Required Qualifications
- PhD candidate in CS/EE (or related) with research in speech, audio ML, or multimodal LMs
- Fluency in Python and PyTorch, with hands-on GPU training; familiarity with torchaudio or librosa
- Working knowledge of modern sequence models (Transformers or SSMs) and training best practices
- Depth in at least one area: (a) discrete speech tokens/temporal compression, (b) modality alignment to LLMs via adapters, or (c) post-training/instruction tuning for speech tasks
- Strong experimentation habits: clean code, ablations, reproducibility, and clear reporting
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.