NTT DATAtoday

AI Eval / Testing (Eval Engineer)

NTT DATA

Hybrid · Dallas, Texas, United States

Dallas, Texas, United StatesHybridFull TimeSenior LevelBachelors DegreeEnterprise

Type

Full Time

Level

Senior Level

Education

Bachelors Degree

Company size

Enterprise

Job Summary

AI Evaluation and Testing Engineer in Dallas, TX to ensure generative AI models and applications are safe, accurate, trustworthy, and deliver an elegant user experience. Build and maintain AI evaluation pipelines to test, measure, and evaluate AI systems; implement traces, spans, and session tracking for observability; define AI quality metrics and KPIs (factuality, faithfulness, toxicity, grounding precision/recall, latency, cost); implement evaluation and testing automation for end-to-end regression testing at scale; define release gates in CI/CD; perform adversarial testing and root-cause analysis; collaborate with cross-functional teams (product, engineering, linguistics, customer support) to shape human-AI interaction; roles include AI Platform Admin, AI Reusable Utility, and AI Common Infrastructure responsibilities; strong emphasis on Python, testing frameworks, LangSmith/DeepEval/TruLens/Promptfoo, LangChain/CrewAI/LlamaIndex, observability, and a startup-paced environment.

Required Qualifications

5+ years of strong proficiency in Python and testing frameworks like pytest
5+ years of hands-on experience with evaluation tools like LangSmith, DeepEval, TruLens, or Promptfoo
3 to 5 years of familiarity with agentic workflows built on LangChain, CrewAI, or LlamaIndex
Understanding of tracing and session tracking to map how errors propagate in RAG systems
5+ years of strong software testing fundamentals and expertise in writing test plans, executing test cases, and generating detailed reports
Strong analytical and debugging skills and attention to detail
5+ years of proficiency in Python, scripting, and software testing automation frameworks and tools such as Pytest, Selenium, Robot Framework
Working knowledge of generative AI models, AI agents, and related concepts such as retrieval augmented generation (RAG), prompt engineering, context engineering, explainability, traceability, observability, guard rails, reasoning, specificity
Understanding of differences in testing conventional software vs evaluating generative AI systems
Team player with ability to collaborate with remote and cross-functional teams
Go-getter attitude for fast-paced startup environments
Experience with AI evaluation frameworks such as Arize, Braintrust, DeepEval, LangSmith, Ragas
AI safety and red teaming experience (prompt injection, jailbreak, adversarial and stress testing)
Different AI evaluation methods (Human-in-the-loop, LLM-as-a-Judge)
Education: Degree in Computer Science, Data Science, Linguistics, or closely related fields
Compliance with responsible AI principles and data policies
Disclosure of subcontractors and offshore delivery locations

Desired Qualifications

Python
pytest
LangSmith
DeepEval
TruLens
Promptfoo
LangChain
CrewAI
LlamaIndex
observability
Selenium
Robot Framework
testing automation
regression testing
CI/CD
adversarial testing
red teaming
quality metrics
data governance
privacy
RAG
LLM-a-judge
human-in-the-loop

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started