Protege3 weeks ago

Research Scientist, Benchmarks & Evaluations

Protege

Remote · United States

United StatesRemoteFull TimeSenior LevelDoctorate Or Professional DegreeUnknown

Type

Full Time

Level

Senior Level

Education

Doctorate Or Professional Degree

Company size

Unknown

Job Summary

Lead the design of benchmarks and evaluations for AI data quality. Own the science of evaluation across DataLab by designing tasks that distinguish model capabilities, validating with human baselines and reliability analyses, and pressure-testing evaluations for contamination and elicitation gaps. Publish research establishing Protege's evaluation data as standards for frontier AI labs, enterprises, and policymakers. Translate findings into deployable evaluation datasets in collaboration with data and engineering teams, and manage the statistical machinery that determines annotator trust and calibration to produce trustworthy scores for customers.

Required Qualifications

Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field
Hands-on experience evaluating LLMs, agents, or other ML systems
Experience with annotator quality and inter-rater reliability
Excellent scientific writing and communication

Desired Qualifications

Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field
Hands-on experience evaluating LLMs, agents, or other ML systems
Experience with annotator quality and inter-rater reliability
Excellent scientific writing and communication
Bonus: RL evaluation techniques, agentic RL pipelines, latent-variable models of annotator skill
Track record of published benchmarks or evaluation papers

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started