Centific2 days ago

Speech Research Intern-2

Centific

$93,600–$93,600 year

Remote · United States or Redmond, Washington, United States

United States or Redmond, Washington, United StatesRemoteFull Time$93,600–$93,600 yearEntry LevelDoctorate Or Professional DegreeUnknown

Type

Full Time

Level

Entry Level

Education

Doctorate Or Professional Degree

Company size

Unknown

Job Summary

PhD Research Intern role focusing on designing and evaluating speech-first models, including spoken language models that reason over audio and interact conversationally. Responsibilities include prototyping end-to-end speech dialogue systems, aligning speech encoders with text backbones via lightweight adapters, efficient speech tokenization and temporal compression for long-form audio, and building an evaluation harness covering ASR/ST/SLU and speech QA with streaming metrics. Projects involve prototyping a conversational SLM with a speech encoder and adapters, creating data recipes blending speech with instruction-following corpora, and shipping a minimal demo with streaming inference and logging. Required skills include PhD candidacy in CS/EE (or related), Python/PyTorch with GPU experience, knowledge of Transformers/SSMs, and experience in at least one area such as discrete speech tokens, modality alignment via adapters, or post-training/instruction tuning for speech tasks. Preferred qualifications cover experience with neural speech codecs/vocoders, multilingual or code-switching speech, robustness and safety evaluation, distributed training (FSDP/DeepSpeed) and tools like ESPnet, SpeechBrain, NVIDIA NeMo, and experience with PyTorch ecosystem tools (CUDA, torchaudio/librosa, ONNX/TensorRT). Location options include Redmond, WA or Remote with flexible scheduling; compensation is a $45 per hour rate (annualized to $93,600) with a stipend and opportunities to publish, mentor, and access GPU infrastructure.

Required Qualifications

PhD candidate in CS/EE (or related) with research in speech, audio ML, or multimodal LMs
Fluency in Python and PyTorch, with hands-on GPU training; familiarity with torchaudio or librosa
Working knowledge of modern sequence models (Transformers or SSMs) and training best practices
Depth in at least one area: (a) discrete speech tokens/temporal compression, (b) modality alignment to LLMs via adapters, or (c) post-training/instruction tuning for speech tasks
Strong experimentation habits: clean code, ablations, reproducibility, and clear reporting

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started