We sat down with two of our Software Engineers, Nikita Rule and Johnathan Pagnutti, to better understand how Clockwise is approaching AI evaluations to improve our product development and user experience. This is an output of our conversation with them.
We’ve all experienced it first-hand: AI can be unpredictable. What works one day, may not work the same way the next day. With the slightest change of context, an LLM’s response can become entirely different.
Reliability and consistency are the underpinnings of earned trust. When AI models are laden with unpredictability, our willingness to depend on them erodes. This is one of the biggest hurdles for AI agents to overcome.
At Clockwise, we’ve encountered this first hand, as our customers look to us for agentic management over their most personal and valuable resource: time.
Because LLMs are inherently non-deterministic, evaluations — commonly called "evals" in the AI space — have become critical infrastructure as systematic tests that measure whether AI models behave as intended. They help teams understand if their models are actually solving the problems they're meant to solve, if they're introducing unintended behaviors, and whether changes to the underlying models improve or degrade performance on real-world tasks.
In the case of Clockwise, evals are an increasingly important step in our product development. An AI that makes decisions with direct impact on people's lives and work simply has to be held to the highest bar. A poorly timed event, inviting the wrong attendee to a meeting, a misread of your priorities… these errors don’t just create bad user experiences, they disrupt your workday and could have real lasting damage to your relationships.
This presents us with a new challenge: how do we systematically test AI that manages something as complex and personal as human time?
Step one in building evaluations for Clockwise's AI features: create the right testing environment.
The Fake Calendar Problem
For Clockwise, that meant generating dummy calendars. While time consuming, this task at least seemed straightforward: we have nearly a decade of calendar data, deep expertise in how people work, and plenty of engineering talent. How hard could it be to create synthetic test calendars?
Turns out, incredibly hard.
Despite our efforts to inject randomness, our synthetic calendars looked nothing like real human behavior. Events were too evenly distributed. Meeting patterns were too rational. Attendees didn’t reflect real-world relationship patterns. Real human schedules are chaotic: back-to-back meetings punctuated by inexplicable gaps, recurring 1:1s that drift across time zones, holds that never get converted — you know, life.
Our fake calendars were too perfect to be insightful, rendering them inadequate for testing AI that needs to understand the real intricacies of human work patterns.
This problem reveals something fundamental about evaluating AI in the calendar and scheduling space: a synthetic environment cannot accurately reproduce actual human behavior, which makes simulating human behavior around time remarkably complex.
A Real Calendar Solution
After struggling to generate fake calendars that look real, we started using anonymized, de-identified, and randomized internal Clockwise production data for our testing environments.
This approach has trade-offs. Clockwise users are already a self-selected group of people who care about calendar optimization, so the dataset inherently reflects quirks and biases. But it's real. It captures the messiness and unpredictability of how people organize their time. It includes the conflicts, the odd patterns, the scheduling behaviors that wouldn't occur to us if we were trying to manufacture them from scratch.
Using real data means we can test our AI against the genuine complexity of human calendars rather than idealized versions of what we think busy calendars look like. One day, we hope to use our real data to generate actually useful and insightful synthetic calendars (a nice comparison: synthetic diamonds are human-made simulacra that are nearly indistinguishable from the real thing). For now, we're focused on creating the right level of imperfection in our test environments—not so chaotic that the evals become meaningless, but not so sanitized that we miss edge cases.
Our Approach: Scaling Nine Years of Intuition
Since 2016, Clockwise has amassed hundreds of millions of calendar data points to build intuition around how people manage their time and what makes scheduling AI helpful versus intrusive. We've learned through user research, customer support conversations, and iterative product development. This intuition has been invaluable, but it doesn't scale on its own.
Evals offer us a path to scale intuition systematically. We're building frameworks that allow us to measure things we've previously only been able to evaluate through gut feeling (read: not very scientific) — whether a scheduling suggestion is genuinely helpful given someone's preferences, whether the AI is picking up on user intent correctly, whether we're minimizing hallucinations in our responses.
Notably, some aspects of our evals still require human judgment. When we're evaluating whether our AI is striking the right tone or whether a suggestion feels presumptuous, we want an actual human in the loop to make that decision.
The foundation we're building is designed to become more systematic and robust over time, layering quantitative insights on top of the qualitative learnings we’ve gathered.
Measuring Our Reliability
We're also continuously rethinking how we score our evaluations. Many eval frameworks assume you're working with a continuous metric — something that can be scored on a spectrum. Some of our tests are closer to pass/fail (either the AI correctly understood the user's intent or it didn't), while other tests may have some nuance in how we measure reliability. The big question we’re looking to answer is “How well can the LLM + MCP server achieve what this person is intending?”
Another useful framework for us has been thinking about predictability rather than continuous scoring. When we expect a test to pass and it passes — or expect it to fail and it fails — that's good. It means we understand our system's boundaries. The problematic cases are the surprises: unexpected passes or unexpected failures that reveal gaps in our understanding.
More ambiguous user requests require human-in-the-loop evaluation and scoring. Take names, for example. Sometimes we expect the AI to just get it: if you ask to meet with John and there’s only one John at your company, the system should easily figure out who you meant.
But when Clockwise receives a scheduling request with an ambiguously named attendee, there are situations where we want the AI to ask for clarification. If you want to meet with Sam, and there are a bunch of Sams at your company, Clockwise should ask you to clarify?
What's Next
We're treating evals as a living, breathing practice that evolves as AI capabilities develop and as we learn from our own experimentation. The problems we're working on — understanding user intent in the domain of time, minimizing hallucinations in scheduling contexts, maintaining appropriate tone when making suggestions about someone's day — these are hard problems, and ones we haven’t fully solved!
That's what makes this work exciting.
After nine years of building intuition about calendars and time management, we're finding new ways to encode that knowledge, scale it, and use it to ship AI features with greater confidence. The fake calendar problem taught us that even the foundational building blocks of evals require rethinking assumptions. We're learning to embrace that complexity rather than try to engineer it away.
And a shout-out to our users: thank you for trusting Clockwise in the very personal and high-stakes domain of your calendar. It’s your continued support that allows us to continuously evolve.


.gif)
.png)

.webp)

