How monday Service Uses LangSmith and LangGraph to Build Reliable AI Service Agents
monday Service implemented an eval-driven development framework using LangSmith and LangGraph to build and monitor customer-facing AI service agents, achieving 8.7x faster evaluation cycles for IT, HR, and Legal support workflows.
Impact
8.7x faster
Evaluation speed improvement
4.1x faster
Parallelization benefit
Challenge
Building reliable customer-facing AI agents where minor prompt deviations cascade into incorrect outcomes, with no efficient way to test and validate agent behavior before production.
Solution
monday Service implemented an eval-driven development framework using LangSmith for evaluation and tracing and LangGraph for agent orchestration, with offline regression testing and online trajectory monitoring.
Tools & Technologies
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.
Full Story
monday Service, the enterprise service management arm of monday.com, set out to build production-grade AI agents capable of handling complex, multi-turn customer conversations across IT, HR, and Legal departments. The fundamental challenge: in agentic systems, even minor prompt or tool-call deviations can cascade into significantly incorrect outcomes, making traditional development approaches insufficient.
The team built an eval-driven development (EDD) framework built on two pillars. Offline evaluations serve as a safety net, running hundreds of test scenarios against sanitized IT tickets before any code reaches production. Online evaluations act as a real-time monitor, scoring entire multi-turn conversation trajectories using LLM-as-judge metrics and tracking business signals like automated resolution and containment rates. LangSmith provided the evaluation platform and tracing infrastructure, while LangGraph powered the ReAct-based agent architecture.
The results demonstrated the power of the approach: evaluation speed improved by 8.7x, from 162 seconds to just 18 seconds per evaluation cycle, through parallelization and concurrent LLM scoring. The team can now evaluate hundreds of examples in minutes rather than hours, enabling rapid iteration on agent behavior. The Evaluations as Code (EaC) pattern they pioneered treats AI judges as versioned TypeScript objects in source control, integrated directly into CI/CD pipelines for continuous quality assurance.