What Happened When a Researcher Spent 90 Days Testing an AI Behavior Coach

A researcher applies behavior change science to their own AI coaching experiment — tracking what worked, what failed, and what the data revealed about the gap between research and practice.

Ananya Krishnan studies digital health and behavior change at a research institute. She has read the primary literature on JITAIs, implementation intentions, and AI coaching. In 2024, she decided to run a structured personal experiment: ninety days of AI-assisted behavior change, documented with the same rigor she would apply to any study she critiques.

What follows is her account. It is a single case study — no control condition, one subject, no generalizability guarantees. Its value is in the specificity of the process documentation and the precision of the observations.


The Setup

Ananya identified three target behaviors she had previously attempted and failed to sustain:

  1. A writing habit (three 45-minute sessions per week, specifically for a long-delayed research monograph)
  2. A daily shutdown ritual (15 minutes at end of workday to clear open loops and plan the following morning)
  3. A weekly review (30 minutes each Sunday to assess the prior week and set priorities)

She chose these three specifically because they represented different habit types: a creative production behavior, a transition ritual, and a reflective practice. She wanted to test whether AI coaching was differentially effective across these types.

Her methodology was explicitly grounded in Gollwitzer’s implementation intention framework and the self-monitoring research from Burke et al. She pre-registered her success criteria (not with a formal registry — just in a dated document) and committed to logging behavioral data daily regardless of AI coaching frequency.

The tools she used:

  • A general-purpose AI assistant (Claude) for weekly review and implementation intention refinement
  • Beyond Time for time tracking and behavioral frequency logging
  • A simple spreadsheet for daily binary logging (did the behavior occur: yes/no)

She explicitly did not use a dedicated AI coaching app. Her reasoning: she wanted to understand what the AI contribution was in isolation, without confounding it with features specific to any particular coaching product.


Weeks 1–3: The Novelty Window

Ananya had anticipated this. She knew from the research that novelty effects would make the first four weeks unreliable as evidence of genuine behavior change.

She logged high completion rates across all three behaviors — above 80% for writing, near-perfect for shutdown ritual, and 100% for weekly review. Her AI interactions in this period were long, exploratory, and felt generative.

Her observation at week three: “The AI conversations are excellent. I look forward to them. I am also aware that I cannot tell whether I am changing behavior or performing change for an invisible audience — which is, structurally, the same problem that affects study participants who know they are being observed.”

She recorded this as the Hawthorne problem of self-study: when you are both the researcher and the subject, the act of observation changes the phenomenon.

She continued but flagged the first three weeks’ data as likely inflated.


Weeks 4–8: The First Differentiation

By week five, the novelty had reduced. Ananya’s completion rates began to differentiate:

  • Writing sessions: holding at 70–75%
  • Daily shutdown ritual: dropping to 45–50%
  • Weekly review: holding at 90–95%

The pattern was not what she had expected. She had assumed the daily shutdown ritual would be easiest — it was the shortest and least cognitively demanding. The writing sessions, requiring 45 uninterrupted minutes, should have been hardest.

What the data showed was that cue reliability was the key variable, not session length or cognitive demand.

The writing sessions were anchored to a reliable cue: morning coffee, before email. The cue was consistent, specific, and not easily displaced. Once she added the implementation intention (“If I have made morning coffee and not yet opened email, then I will write for 45 minutes before doing anything else”), the behavior became almost automatic within ten days.

The daily shutdown ritual had no reliable cue. Her workday ended at variable times depending on meetings. The implementation intention she had written — “If it is 5:30pm, then I will begin shutdown ritual” — was regularly defeated because she was frequently in meetings or deep in work at 5:30pm. The cue was a clock time rather than a behavioral event.

The weekly review held because Sunday morning was a consistent context with few competing demands.

The AI’s role in diagnosing this: Ananya brought her week-five data to a Claude session and described the pattern. The AI immediately identified the cue reliability problem with the shutdown ritual and suggested an event-based cue instead: “If I close my last meeting or task of the day, then I will begin the shutdown ritual before opening any new task.” She revised the implementation intention and saw completion rates rise to 65–70% over the following two weeks.

“The AI did not discover anything I could not have discovered myself,” she noted. “But it gave me a structured prompt to examine the data, and the structured prompt was the intervention. Without it, I would have adjusted vaguely. With it, I changed something specific.”


Weeks 9–12: Testing Fade-Out

At week nine, Ananya deliberately reduced her AI interactions. Instead of weekly review sessions with Claude, she reviewed her behavioral data herself and only consulted the AI when she identified a specific pattern she could not explain.

Her prediction: if the habits had genuinely internalized, they would hold without AI prompting. If they depended on AI scaffolding, they would degrade.

The results were mixed and informative:

Writing habit: Held at 70% completion without any AI interaction. The cue was strong enough to sustain the behavior independently. Ananya considered this a successful habit formation — the behavior was running on its own cue-routine-reward structure.

Shutdown ritual: Dropped to 40% without AI review sessions. The behavior had not automatized. The event-based cue (“last meeting or task”) was insufficiently reliable because “last task” was ambiguous — she frequently told herself a task was not done and kept working. The AI check-ins had been functioning as an external accountability mechanism rather than as a scaffold for genuine automaticity.

Weekly review: Held at 85%. The contextual anchoring to Sunday morning was strong enough to sustain the behavior.

The finding she had not anticipated: The shutdown ritual had been the most improved behavior during AI coaching (going from near-zero to 65–70%) but was also the most dependent on continued AI support. She had been tracking and celebrating the improvement without noticing that the mechanism was accountability rather than automaticity.

“This is the exact confound that the research literature identifies,” she wrote in her notes, “and I still fell into it. The tool was creating a performative effect — I was doing the behavior because I knew I was logging it and reviewing it with AI. When the review stopped, so did a significant portion of the behavior.”


What the 90 Days Revealed

Ananya’s documented observations, organized by research question:

Does AI-assisted implementation intention design work?

Yes, with clear caveats. AI was effective at helping her identify weak cues (clock times vs. behavioral events) and revise implementation intentions accordingly. The value was in the structured analytical prompt rather than in the AI generating the plan — she was always the one to commit to the revised intention.

Does AI-assisted self-monitoring add value over manual self-monitoring?

Partially. The combination of binary behavioral logging (her own spreadsheet) and AI pattern analysis produced insights she would not have generated from the raw data alone. The AI’s ability to quickly identify the cue reliability pattern in her week-five data was the clearest example.

Does AI coaching produce genuine automaticity, or just accountability effects?

This was the most honest and unsettling finding. For the writing habit, genuine automaticity appeared to form — the behavior held without coaching. For the shutdown ritual, the improvement appeared to be primarily an accountability effect. The distinction matters enormously, because accountability effects disappear when the accountant leaves.

What would she do differently?

“I would be more aggressive about testing for automaticity earlier. At week six, I should have run a one-week no-AI experiment for each behavior rather than waiting until week nine. The earlier I know whether a behavior is genuinely automatic or scaffold-dependent, the sooner I can address the underlying cue design rather than continuing to paper over it with accountability.”


The Practical Takeaways

For anyone attempting something similar:

Run your behavioral log independently of your AI interactions. If the log and the AI conversations are the same thing, you cannot separate accountability effects from genuine change.

Test for automaticity at six weeks by cutting AI prompting for seven days. The behaviors that hold are automatic. The behaviors that fall apart are accountability-dependent.

When AI is most useful: identifying specific structural problems in your implementation intentions (cue reliability, behavior ambiguity, success criterion vagueness). When it is least useful: general motivation and encouragement, which produce engagement without mechanism.

The hardest thing to track is not whether you did the behavior — that is a simple yes/no. The hardest thing is whether you did the behavior because of the cue, or because you were being observed. That distinction requires the courage to stop observing yourself for a week and see what survives.


The experiment is available for replication. The documents (implementation intentions, behavioral log template, weekly review prompts) are described in sufficient detail above to reproduce. If you run a similar experiment, the most valuable data you can generate is the fade-out period — what holds when the scaffolding comes down.


Related:

Tags: researcher tests AI coach, AI behavior change case study, implementation intentions experiment, habit automaticity test, self-monitoring research

Frequently Asked Questions

  • Was this a formal research study?

    No. This is a documented personal experiment, structured around published behavior change frameworks. It is a case study — single subject, no control condition — and should be interpreted accordingly. The value is in the process transparency and the specific observations, not in generalizable causal claims.
  • What were the target behaviors in this experiment?

    Three behaviors: a consistent writing habit (three 45-minute sessions per week), a daily shutdown ritual (15 minutes at end of workday), and a weekly review practice (30 minutes each Sunday). All three were existing behavioral intentions the subject had previously failed to sustain.
  • What made the difference between the habits that succeeded and those that failed?

    Cue reliability. The writing habit succeeded because it was anchored to a reliable existing behavior (morning coffee). The shutdown ritual was inconsistent because the end of the workday was poorly defined and variable. Cue quality predicted success more than AI coaching intensity.
  • What role did Beyond Time play in the experiment?

    Beyond Time was used for the time-tracking component of self-monitoring — logging how much time was actually spent on target behaviors versus intended. This provided the behavioral frequency data that the weekly AI reviews were based on.