Models update on schedules you don't control. Behavior drifts between versions in ways that aren't always documented. The product your user tested last Tuesday might respond differently today.
People's mental models of AI are calibrating in real-time. Someone confused by ChatGPT six months ago might be a power user now. The "novice" and "expert" categories we rely on are less stable than they used to be.
twobody.ai
Experiments in studying systems that change with every interaction.
The Problem
UX research has always assumed you're studying a stable system. You observe users, find patterns, make recommendations, ship changes. The product stays put long enough for your insights to matter.
With AI products, that assumption falls apart. Both sides of the equation are moving. The model gets updated. User expectations shift as they learn what AI can and can't do. By the time you've synthesized your findings, the thing you studied isn't quite the thing that's shipping.
This is a collection of experiments and thoughts on how UX research tools and practices might need to evolve.
The Problem Space
Click any node to explore how the pieces connect
CORE PROBLEM
Both the AI and user are moving targets
APPROACH 1
Scale the N
Compensate for instability with volume
APPROACH 2
Control AI variance
Hold the system constant during testing
APPROACH 3
Rethink what to measure
New metrics for both user and AI
APPROACH 4
Go upstream
Abstract each side, then compare
H1
Synthetic users find the same issues as real users
H2
Passive AI moderators improve some data
H3
We can detect when research is invalidated
H4
Trust calibration follows measurable patterns
H5
AI analysis works for structured questions
H6
Surfacing AI confidence improves user calibration
RELATED RESEARCH
📓 NotebookLM deep dive coming soon
Hypotheses in Detail
H1
TestingAI-simulated users can identify the same major usability issues as real users for most routine evaluations.
If true, this would let us run cheap, fast sanity checks before investing in real user research. The interesting question isn't "does it work" but "where does it break down." My guess is it fails on anything involving trust, emotional response, or domain expertise.
Experiments
H2
ExploringAI moderators playing passive observer roles—and human moderator removal in specific contexts—can improve data quality.
Two related ideas here. First: some participants perform more naturally without a human researcher watching. Second: AI moderators can adopt passive observer roles that humans simply cannot—infinitely patient, never reacting, intervening only when things go off the rails. The question is whether this captures what matters while reducing observer effects. I suspect the answer depends heavily on task type and participant comfort with AI.
Experiments
H3
QueuedWe can detect when an AI product has changed enough that previous research findings no longer apply.
Right now, research invalidation is vibes-based. Someone notices the product feels different, maybe. What if we could instrument this—track behavioral signatures over time and flag when drift crosses a threshold? Not sure if it's possible, but worth exploring.
Experiments
H4
QueuedUser trust calibration follows predictable patterns that can be measured longitudinally.
People start with some mental model of what AI can do. They use it, get surprised (positively or negatively), and adjust. Over time, their expectations stabilize—or don't. If there's a pattern here, we could design for trust calibration instead of just measuring satisfaction at a point in time. The tricky part is that people are often bad at introspecting on their own mental models, so we'd need both self-reported measures and behavioral proxies—what people say they expect vs. how they actually behave.
Experiments
H5
ExploringAI analysis of qualitative data approaches human-level quality for well-structured research questions.
"AI can analyze interviews" is too broad. The real question is: for which types of analysis, with what constraints, and how do you know when it's working? My hunch is it's good at finding patterns in explicit statements and bad at interpreting what people didn't say.
Experiments
H6
QueuedSurfacing AI confidence signals to users improves their trust calibration.
This connects the "measure the AI" angle with user outcomes. If the AI can tell you when it's uncertain—and you actually expose that to users—does their mental model calibrate better? This is a design intervention that could be tested empirically. The interesting questions: what form should confidence signals take? Do users actually attend to them? Does it help, or just create anxiety?
Experiments
I'll share what I learn as I go. New experiments roughly every two weeks.