Protege logo
Protege3 weeks ago

Research Scientist, Benchmarks & Evaluations

Remote · United States

Type
Full Time
Level
Senior Level
Education
Doctorate Or Professional Degree
Company size
Unknown

Job Summary

Lead the design of benchmarks and evaluations for AI data quality. Own the science of evaluation across DataLab by designing tasks that distinguish model capabilities, validating with human baselines and reliability analyses, and pressure-testing evaluations for contamination and elicitation gaps. Publish research establishing Protege's evaluation data as standards for frontier AI labs, enterprises, and policymakers. Translate findings into deployable evaluation datasets in collaboration with data and engineering teams, and manage the statistical machinery that determines annotator trust and calibration to produce trustworthy scores for customers.

Required Qualifications

  • Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field
  • Hands-on experience evaluating LLMs, agents, or other ML systems
  • Experience with annotator quality and inter-rater reliability
  • Excellent scientific writing and communication

Desired Qualifications

  • Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field
  • Hands-on experience evaluating LLMs, agents, or other ML systems
  • Experience with annotator quality and inter-rater reliability
  • Excellent scientific writing and communication
  • Bonus: RL evaluation techniques, agentic RL pipelines, latent-variable models of annotator skill
  • Track record of published benchmarks or evaluation papers
Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

Protege

Research Scientist, Benchmarks & Evaluations

Apply on Sorce