Research Scientist, Benchmarks & Evaluations
Remote · United States
Job Summary
Lead the design of benchmarks and evaluations for AI data quality. Own the science of evaluation across DataLab by designing tasks that distinguish model capabilities, validating with human baselines and reliability analyses, and pressure-testing evaluations for contamination and elicitation gaps. Publish research establishing Protege's evaluation data as standards for frontier AI labs, enterprises, and policymakers. Translate findings into deployable evaluation datasets in collaboration with data and engineering teams, and manage the statistical machinery that determines annotator trust and calibration to produce trustworthy scores for customers.
Required Qualifications
- Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field
- Hands-on experience evaluating LLMs, agents, or other ML systems
- Experience with annotator quality and inter-rater reliability
- Excellent scientific writing and communication
Desired Qualifications
- Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field
- Hands-on experience evaluating LLMs, agents, or other ML systems
- Experience with annotator quality and inter-rater reliability
- Excellent scientific writing and communication
- Bonus: RL evaluation techniques, agentic RL pipelines, latent-variable models of annotator skill
- Track record of published benchmarks or evaluation papers
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.