Senior Software Engineer - Model Training & AI Evals
Remote · India
Job Summary
Senior Software Engineer to own end-to-end evaluation and benchmarking infrastructure for LLMs and base models, contribute hands-on to post-training pipelines, and lead domain-specific benchmarks and synthetic data generation to drive model improvements. Responsibilities include designing task-level evaluation frameworks, building comparative benchmarking pipelines, producing capability gap reports, tracking model-version regressions, and collaborating with product, curriculum, and research teams to translate eval insights into post-training and data flywheel decisions. Requires hands-on experience with SFT, RLHF, RLAIF, DPO, PPO, reward modeling, and data quality criteria, plus strong software engineering skills (Python, PyTorch/JAX) and experience with CI/CD and experiment tracking.
Required Qualifications
- 5+ years of ML/AI engineering experience, with at least 2–3 years focused on large language models
- Direct, hands-on experience at an LLM lab, AI research organization, or equivalent frontier AI team
- Familiarity with the full model lifecycle: pre-training data, post-training alignment, eval, and production deployment
- Deep practical expertise in post-training methods: SFT, RLHF, RLAIF, DPO, PPO
- Experience with reward modeling, preference data curation, and quality control for alignment pipelines
- Demonstrated experience designing LLM evaluation frameworks beyond standard benchmarks
- Hands-on experience building synthetic data generation pipelines for addressing model capability gaps
- Validating synthetic data quality through downstream model performance experiments
- Proven track record of comparative benchmarking across leading foundation models
- Experience training or fine-tuning vertical/industry-specific foundation models
- Strong software engineering fundamentals: Python, PyTorch or JAX, distributed training
- Publications or applied research contributions in LLM evaluation or alignment (preferred)
- Experience with multi-modal models or agents with external tool/API use
- Exposure to red-teaming, adversarial evaluation, or safety benchmarking
- Model distillation, speculative decoding, or inference optimization experience
- Prior experience in education, STEM, legal, biomedical, or enterprise software vertical
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.