AI Eval / Testing (Eval Engineer)
Hybrid · Dallas, Texas, United States
Job Summary
AI Evaluation and Testing Engineer in Dallas, TX to ensure generative AI models and applications are safe, accurate, trustworthy, and deliver an elegant user experience. Build and maintain AI evaluation pipelines to test, measure, and evaluate AI systems; implement traces, spans, and session tracking for observability; define AI quality metrics and KPIs (factuality, faithfulness, toxicity, grounding precision/recall, latency, cost); implement evaluation and testing automation for end-to-end regression testing at scale; define release gates in CI/CD; perform adversarial testing and root-cause analysis; collaborate with cross-functional teams (product, engineering, linguistics, customer support) to shape human-AI interaction; roles include AI Platform Admin, AI Reusable Utility, and AI Common Infrastructure responsibilities; strong emphasis on Python, testing frameworks, LangSmith/DeepEval/TruLens/Promptfoo, LangChain/CrewAI/LlamaIndex, observability, and a startup-paced environment.
Required Qualifications
- 5+ years of strong proficiency in Python and testing frameworks like pytest
- 5+ years of hands-on experience with evaluation tools like LangSmith, DeepEval, TruLens, or Promptfoo
- 3 to 5 years of familiarity with agentic workflows built on LangChain, CrewAI, or LlamaIndex
- Understanding of tracing and session tracking to map how errors propagate in RAG systems
- 5+ years of strong software testing fundamentals and expertise in writing test plans, executing test cases, and generating detailed reports
- Strong analytical and debugging skills and attention to detail
- 5+ years of proficiency in Python, scripting, and software testing automation frameworks and tools such as Pytest, Selenium, Robot Framework
- Working knowledge of generative AI models, AI agents, and related concepts such as retrieval augmented generation (RAG), prompt engineering, context engineering, explainability, traceability, observability, guard rails, reasoning, specificity
- Understanding of differences in testing conventional software vs evaluating generative AI systems
- Team player with ability to collaborate with remote and cross-functional teams
- Go-getter attitude for fast-paced startup environments
- Experience with AI evaluation frameworks such as Arize, Braintrust, DeepEval, LangSmith, Ragas
- AI safety and red teaming experience (prompt injection, jailbreak, adversarial and stress testing)
- Different AI evaluation methods (Human-in-the-loop, LLM-as-a-Judge)
- Education: Degree in Computer Science, Data Science, Linguistics, or closely related fields
- Compliance with responsible AI principles and data policies
- Disclosure of subcontractors and offshore delivery locations
Desired Qualifications
- Python
- pytest
- LangSmith
- DeepEval
- TruLens
- Promptfoo
- LangChain
- CrewAI
- LlamaIndex
- observability
- Selenium
- Robot Framework
- testing automation
- regression testing
- CI/CD
- adversarial testing
- red teaming
- quality metrics
- data governance
- privacy
- RAG
- LLM-a-judge
- human-in-the-loop
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.