TechnologyCustomer Service

How monday Service Uses LangSmith and LangGraph to Build Reliable AI Service Agents

monday Service implemented an eval-driven development framework using LangSmith and LangGraph to build and monitor customer-facing AI service agents, achieving 8.7x faster evaluation cycles for IT, HR, and Legal support workflows.

Impact

8.7x faster

Evaluation speed improvement

4.1x faster

Parallelization benefit

Challenge

Building reliable customer-facing AI agents where minor prompt deviations cascade into incorrect outcomes, with no efficient way to test and validate agent behavior before production.

Solution

monday Service implemented an eval-driven development framework using LangSmith for evaluation and tracing and LangGraph for agent orchestration, with offline regression testing and online trajectory monitoring.

Tools & Technologies

Get the full context.

Sign up to read complete case studies, access detailed metrics, and unlock all use cases.

Full Story

monday Service, the enterprise service management arm of monday.com, set out to build production-grade AI agents capable of handling complex, multi-turn customer conversations across IT, HR, and Legal departments. The fundamental challenge: in agentic systems, even minor prompt or tool-call deviations can cascade into significantly incorrect outcomes, making traditional development approaches insufficient.

The team built an eval-driven development (EDD) framework built on two pillars. Offline evaluations serve as a safety net, running hundreds of test scenarios against sanitized IT tickets before any code reaches production. Online evaluations act as a real-time monitor, scoring entire multi-turn conversation trajectories using LLM-as-judge metrics and tracking business signals like automated resolution and containment rates. LangSmith provided the evaluation platform and tracing infrastructure, while LangGraph powered the ReAct-based agent architecture.

The results demonstrated the power of the approach: evaluation speed improved by 8.7x, from 162 seconds to just 18 seconds per evaluation cycle, through parallelization and concurrent LLM scoring. The team can now evaluate hundreds of examples in minutes rather than hours, enabling rapid iteration on agent behavior. The Evaluations as Code (EaC) pattern they pioneered treats AI judges as versioned TypeScript objects in source control, integrated directly into CI/CD pipelines for continuous quality assurance.

Similar Cases

P
Pfizer
93%
database reduction

Pfizer achieved a 93% database reduction and 20% cost avoidance by migrating their global SAP environment to S/4HANA on IBM Power10 infrastructure.

PharmaceuticalsTechnologyICIBM ConsultingIPIBM Power Virtual Server
J
Jamf
Under 45 minutes
performance review skill build time

Jamf deployed Claude Enterprise across 16 departments, then built interactive workflow skills using Claude Cowork that transformed manual spreadsheet-based processes into guided, conversational experiences. Performance reviews that previously required months of effort are now built in under 45 minutes, and non-engineering teams independently create custom data dashboards.

TechnologyCEClaude EnterpriseCCClaude Cowork
C
Confluent
15,000+
hours saved monthly

Confluent, a data streaming platform company with 2,000+ employees and 4,000+ customers, deployed Glean to solve the knowledge fragmentation that came with rapid growth from 250 to 2,000+ employees across 20+ systems. Glean indexed the company's full tool stack — Slack, Salesforce, Confluence, and more — enabling instant knowledge retrieval across all teams. The result: 15,000+ hours saved monthly, a 13% increase in support team satisfaction, and over 70% employee adoption.

TechnologyGGlean
C
Classmethod
up to 90%
reduction in development time

Classmethod, a leading Japanese cloud integrator, deployed Claude Code across its engineering teams to address chronic developer shortages. The tool automated code generation, review, and testing workflows, reducing development time by up to 90% on specific tasks and cutting code review time by 80%.

TechnologyCCClaude Code
L
Lusha
300%
increase in outbound leads

Lusha is a B2B sales intelligence platform with 1.5 million users and a database of over 200 million business contacts. By deploying Elasticsearch as both a full-text search engine and a vector database for AI-powered lead recommendations, Lusha helps customers generate 300% more leads, achieve conversion rates up to 10x higher, and realize return on investment of up to 1,000%.

TechnologyEElasticsearch
A
Aquant
98%+
retrieval accuracy

Aquant is an agentic AI platform purpose-built for professionals servicing complex industrial and medical equipment at large manufacturing companies. When the company’s homegrown vector search infrastructure—built on PostgreSQL extensions—began to slow under real-time production demands, Aquant migrated to Pinecone as the retrieval backbone for its AI platform. The switch delivered sub-100ms semantic search, pushed retrieval accuracy above 98%, and helped Aquant’s customers cut average service resolution time by 49%.

TechnologyPPinecone
N
Nextdoor
2–3x
engineering productivity improvement

Nextdoor, the neighborhood social network, deployed Glean as a unified Work AI layer embedded directly into the tools employees already use. Rather than mandating adoption, the team built a self-reinforcing learning loop of Slack channels, live office hours, and quick-win storytelling that turned early experimentation into company-wide AI habits — with engineering productivity gains of 2–3x and RevOps workflows shrinking from hours to minutes.

TechnologyGGlean
H
Hostinger
Minutes vs. days
website creation time

Hostinger partnered with Anthropic to build Hostinger Horizons, an AI-powered platform that converts natural language prompts into complete, functional websites and applications. The solution eliminates the steep learning curve of traditional web builders, enabling non-technical users to create professional online presences in minutes instead of days.

TechnologyCClaude