How a Team Lead Built an AI Planning Stack That Actually Survived 6 Months

A case study following one engineering lead through three versions of her AI planning stack—what collapsed, what she changed, and the minimal stack that held.

Nadia is an engineering lead at a mid-sized SaaS company. She manages a team of seven, owns three active projects, and spends roughly half her working hours in meetings she did not schedule.

She is not unusual. But her path through three versions of an AI planning stack—what broke each one and what finally held—illustrates the structural lessons that are easy to miss when you are evaluating tools in isolation.

This is her story, reconstructed with her permission from her planning logs and our conversations.


Baseline: The Pre-AI State

Before any AI planning tools, Nadia’s system was a combination of Linear (for team tasks), a personal Notion page (for weekly notes and priorities), Google Calendar (for meetings), and a mental model of what mattered that she reconstructed every Monday morning.

The mental reconstruction took 30–45 minutes. By Wednesday, it was usually outdated. By Friday, she was reacting to the loudest thing in front of her rather than the most important thing in her plan.

The symptom was not chaos. Her team delivered. But she consistently felt that her time was going toward the tasks that had arrived most recently, not the tasks with the highest leverage for her team’s output. The gap between intention and execution was running at roughly 40%—she completed about 60% of what she had planned to complete on a given week, with the remainder rolling forward indefinitely.

Her friction point was clearly prioritization, not capture. She knew what needed to happen. She could not reliably choose between competing demands in real time.


Version 1: The Elaborate Stack

In January, Nadia built what she called her “proper system.” It included:

  • Claude for weekly planning sessions
  • ChatGPT with Notion plugin for daily check-ins that pulled her task list
  • Notion AI to auto-generate weekly review summaries from her Notion log
  • Zapier automations to push Linear tickets to Notion when they were assigned to her
  • Google Calendar as the scheduling layer

Five tools with four integration points. Total configuration time: one weekend.

The system worked well for two weeks. The weekly Claude session produced clear priorities. The daily ChatGPT check-ins surfaced the right tasks. The automations kept Notion populated without manual entry.

Then the Zapier automation stopped firing after a Linear API update. Nadia did not notice for four days because the flow of tasks into Notion was gradual. When she did notice, Notion had three dozen tasks missing from its view. She spent two hours reconciling.

The following week, she skipped the Notion AI review because she was behind from the reconciliation work. The week after that, she skipped the ChatGPT check-ins because a deadline week had eliminated her morning routine. By week six, she was using only Claude for the weekly session and had quietly abandoned everything else.

Version 1 lesson: The automation layer created a single point of failure that was invisible until it caused significant debt. The stack was only as strong as its most fragile integration.


Version 2: The Simplified Stack

After Version 1’s collapse, Nadia stripped back to three tools and eliminated all automations.

  • Linear for team and personal task management (no change—already the system of record)
  • Claude for a 30-minute weekly planning session on Sunday evenings
  • Google Calendar for scheduling, with time blocks added manually after the Claude session

No automations. No cross-tool data sync. Data transfer between Claude and Calendar was deliberate and manual: she read the Claude output and created calendar blocks herself.

This manual transfer step—which she had tried to automate in Version 1—turned out to serve a useful function. The act of translating Claude’s recommendations into calendar blocks forced her to evaluate whether the recommended time allocations were realistic. The automation had removed that friction. The manual step restored it.

Version 2 ran for eight weeks without a structural failure.

The weekly Claude session used a consistent prompt:

Here is my task list for the coming week from Linear:
[paste]

My non-negotiable meetings are:
[paste from calendar]

My single most important deliverable this week is: [X]
My team's current blocker that I need to unblock is: [Y]

Given these constraints, rank the top five tasks I should personally own
this week (not delegate). Explain the ordering. Flag anything I should
push back on rather than absorb.

The prompt’s key feature was the explicit “push back rather than absorb” instruction. It gave Claude permission to recommend that Nadia decline or defer work, which was the prioritization conversation she had been unable to have with herself. The AI could hold the ground she was too close to the work to hold.


The Remaining Failure: The Review Gap

Version 2 worked well at prioritization and scheduling. The persistent failure was the review layer.

Nadia’s Sunday planning session produced a useful week plan. But without a structured Friday review, the following Sunday’s session started from scratch. She could not ask Claude to compare last week’s plan to what actually happened because she had not logged what actually happened.

She tried adding a five-minute Friday log to her calendar: one sentence per day about what she had worked on and what she had not. Over four weeks, she completed this log on 11 of 20 days—below the threshold where the data was reliable enough to use.

The log failed because it was a new behavior with no existing trigger. It had no habit anchor.

The solution was attaching the Friday log to a meeting she was already running—her Friday team standup. She added a two-minute personal note after the standup ended, while the work was still in front of her. Completion rate rose to 18 of 20 days in the subsequent month.

With a consistent Friday log, the Sunday Claude session became significantly more useful:

Here is what I planned to do last week:
[paste from previous Sunday session]

Here is what I actually did:
[paste from Friday logs]

Where did my plan and reality diverge most? What is the most likely
structural reason—not a one-off event but a repeating pattern?
What should I build into next week's plan to account for it?

The gap analysis produced by Claude consistently identified patterns Nadia had not named. Over six weeks, three recurring patterns emerged: she was systematically underestimating meeting preparation time, she was accepting requests on Tuesday afternoons when her energy and judgment were lowest, and she had no deep work blocks protected before 11am despite her best work happening before 9am.

Each of these was correctable once named. None of them had been visible before the review loop was in place.


Version 3: The Durable Stack

At month four, Nadia added one more tool: Beyond Time for daily scheduling. The Sunday Claude session produced a priority-ranked task list. Beyond Time converted that list into a daily time-block view, tracking actuals against the plan throughout the week. The Friday log drew from Beyond Time’s actuals rather than from memory.

The full stack:

ToolRoleFrequency
LinearTeam and personal task storeContinuous
ClaudeWeekly prioritization and review reasoningSunday + Friday
Google CalendarHard commitments and external meetingsContinuous
Beyond TimeDaily time allocation and actuals trackingDaily

Four tools with zero automations between them. The handoffs are all deliberate human decisions.

This stack has now run without structural failure for six months. The planning accuracy metric Nadia tracks—completed intended tasks as a percentage of planned tasks—has moved from 60% to 78%.


The Four Lessons Worth Extracting

Lesson 1 — Automations that remove human decision points often remove useful information. The manual transfer from Claude’s recommendations to Google Calendar, which Nadia had tried to automate, turned out to be where she evaluated whether recommendations were realistic. Automating it removed the quality check.

Lesson 2 — Stack collapses are usually silent before they are catastrophic. Nadia did not know her Zapier automation had broken for four days because the failure was gradual. Complex stacks with many integration points need a weekly “is everything working?” check that is separate from the planning session itself.

Lesson 3 — Review loops require habit anchors, not calendar invites. A standalone Friday review did not stick. A two-minute note attached to an existing meeting did. The behavior was identical; the trigger was different.

Lesson 4 — The right stack is usually smaller than you think and stays stable longer than you expect. Nadia’s Version 1 stack was more impressive. Her Version 3 stack is more useful. The simplification was not a downgrade; it was the point.


Start Where Nadia Finished, Not Where She Started

If you are building an AI planning stack for a management or leadership role, the Version 3 configuration is a reasonable starting point: one task store, one reasoning AI, one calendar, and one daily scheduling tool. Four tools with clear roles and no automations.

Run it for six weeks. Do not add anything during those six weeks. Evaluate the one thing that still fails, and address only that.


Related:

Tags: AI planning case study, engineering lead productivity, AI tools for managers, planning stack design, weekly review system

Frequently Asked Questions

  • What does a successful AI planning stack look like for an engineering lead?

    Based on this case study, the most durable stack was a three-tool configuration: Linear for team task management, Claude for weekly prioritization and planning sessions, and Google Calendar as the scheduling source of truth. Each tool had one exclusive role and no data was manually synchronized between tools.
  • Why do elaborate AI planning stacks collapse for managers?

    Managers face higher planning interruption rates than individual contributors. A system that requires consistent daily maintenance will always lose to meeting-driven days. The stacks that survive for managers are the ones that can tolerate missed days without accumulating debt.
  • How long does it take for a new AI planning stack to show results?

    In this case, meaningful improvement in schedule accuracy appeared at week six—after the initial configuration overhead had been absorbed and the habits around weekly review had stabilized. Four weeks is the minimum evaluation period; eight is more reliable.