Enterprise SoftwareCustomer Service & Support

How monday Service Uses LangSmith and LangGraph to Build Reliable AI Service Agents

monday Service implemented an eval-driven development framework using LangSmith and LangGraph to build and monitor customer-facing AI service agents, achieving 8.7x faster evaluation cycles for IT, HR, and Legal support workflows.

Impact

8.7x faster

Evaluation speed improvement

4.1x faster

Parallelization benefit

Challenge

Building reliable customer-facing AI agents where minor prompt deviations cascade into incorrect outcomes, with no efficient way to test and validate agent behavior before production.

Solution

monday Service implemented an eval-driven development framework using LangSmith for evaluation and tracing and LangGraph for agent orchestration, with offline regression testing and online trajectory monitoring.

Tools & Technologies

Get the full story.

Sign up to read complete case studies, access detailed metrics, and unlock all use cases.

Full Story

monday Service, the enterprise service management arm of monday.com, set out to build production-grade AI agents capable of handling complex, multi-turn customer conversations across IT, HR, and Legal departments. The fundamental challenge: in agentic systems, even minor prompt or tool-call deviations can cascade into significantly incorrect outcomes, making traditional development approaches insufficient.

The team built an eval-driven development (EDD) framework built on two pillars. Offline evaluations serve as a safety net, running hundreds of test scenarios against sanitized IT tickets before any code reaches production. Online evaluations act as a real-time monitor, scoring entire multi-turn conversation trajectories using LLM-as-judge metrics and tracking business signals like automated resolution and containment rates. LangSmith provided the evaluation platform and tracing infrastructure, while LangGraph powered the ReAct-based agent architecture.

The results demonstrated the power of the approach: evaluation speed improved by 8.7x, from 162 seconds to just 18 seconds per evaluation cycle, through parallelization and concurrent LLM scoring. The team can now evaluate hundreds of examples in minutes rather than hours, enabling rapid iteration on agent behavior. The Evaluations as Code (EaC) pattern they pioneered treats AI judges as versioned TypeScript objects in source control, integrated directly into CI/CD pipelines for continuous quality assurance.

Similar Cases

A
Asana
Days to ~15 minutes
campaign brief review cycle

Asana built AI Teammates — autonomous agents powered by Claude Opus 4.6 — that work alongside human teams within existing Asana workflows. The agents handle campaign brief drafting, launch tracking, compliance review, and HR triage, compressing brief review cycles from multiple days down to approximately 15 minutes. A multi-agent architecture routes tasks to specialized subagents optimized for speed or reasoning.

Enterprise SoftwareCOClaude Opus 4.6CAClaude API
J
Jamf
Under 45 minutes
performance review skill build time

Jamf deployed Claude Enterprise across 16 departments, then built interactive workflow skills using Claude Cowork that transformed manual spreadsheet-based processes into guided, conversational experiences. Performance reviews that previously required months of effort are now built in under 45 minutes, and non-engineering teams independently create custom data dashboards.

Enterprise SoftwareCEClaude EnterpriseCCClaude Cowork
D
Duvo
€2.8M+
annualized savings at rohlik group within three months

Duvo builds AI agents that orchestrate multi-step workflows across fragmented enterprise systems—ERPs, supplier portals, spreadsheets, and email—without requiring API integrations. Using Claude, Duvo's agents handle tasks such as supplier chasing, price monitoring, and promotional setup with human-in-the-loop approval for high-risk actions. Early customer Rohlik Group realized €2.8M in annualized savings within three months of deployment.

Operations AutomationEnterprise SoftwareCClaude