From Roadmap to Reality: Building My Learning Engineer Agent (and Wrestling with Web Research)

Feb 15, 2026

A founder-style diary on building an agentic EdTech AI platform, tackling web research reliability issues, and applying STORM-inspired thinking.

From Roadmap to Reality: Building My Learning Engineer Agent (and Wrestling with Web Research)

~5 minute read · personal build diary entry

I’m at an interesting point in my Learning Engineer Agent journey.

On paper, the roadmap is clear. I know the product vision, the business model, the pricing, the architecture, and the exact sequence of capabilities I want to build from Days 14–20. But in practice, I’m discovering what every builder eventually learns: the hardest part of agentic AI is not writing a beautiful roadmap — it’s making systems reliable under real-world constraints.

This entry is my attempt to capture where I am right now: what I’m building, what’s working, what’s painful, and what I’m learning.

What I’m building (in one line)

I’m building an AI-powered, multi-agent EdTech system that can turn expert knowledge into full course design artifacts — starting from research and knowledge base generation, then moving into adaptive Socratic interviews, outlines, and storyboarded slides.

The core differentiator I care about most is this:

Not static “fill a form and get content” workflows, but truly agentic interactions — especially live expert interviews that adapt in real time.

That interview experience is my portfolio centerpiece. It’s the part I want people to feel is different.

If you want to follow the build in public, here’s my repo: learning-engineer-agent on GitHub.

Current reality: I’m between “promising prototype” and “production-grade system”

I already have meaningful progress:

Flutter web app live
Auth and Firestore persistence working
Diagnostic service deployed
Interview service deployed
Invitation workflow in place

So this is not a conceptual project anymore. It’s running. But I’m now entering the stage where quality expectations jump. Days 14+ are about research depth, adaptive reasoning, and consistency. That’s exactly where weak spots become visible.

The challenge that keeps showing up: web research reliability

I’ve been asking a very practical question lately:

If I can get high-quality research in ChatGPT UI, why is production API-based research harder to make consistently great?

The answer I’m converging on is that UI quality usually comes from a lot of hidden orchestration: query planning, multi-step search, ranking, synthesis, and quality controls. In my own stack, I have to build that layer explicitly.

The friction points are now clear:

Rate limits and quota bursts when research runs in parallel
Output variability when grounded search can’t always enforce strict JSON schema the way I want
Parsing brittleness when model output format drifts
Quality drift where sources are technically valid but pedagogically weak
Silent degradation risk where pipelines “complete” but with thin insights

None of these are unsolvable. But all of them are product-defining.

The important mindset shift I’m making

Initially, I was thinking: “How do I make web search better?”

Now I’m thinking: “How do I make my system robust even when web search is imperfect?”

That’s a huge shift.

I’m moving toward a layered reliability mindset:

Search as one signal, not gospel
Evidence quality gating before generation
Fallback paths when research is weak
Explicit degradation states instead of fake confidence
Human-editable checkpoints at every stage

This aligns with my product philosophy anyway: human-in-the-loop isn’t a compromise — it’s a trust strategy.

How I came across STORM (and why it clicked)

While exploring ways to close the quality gap, I started looking at open-source research systems that simulate deep, iterative exploration. That’s where I came across STORM-style research pipelines.

https://github.com/stanford-oval/storm

What I like about STORM isn’t hype — it’s structure:

Topic decomposition
Multi-step evidence gathering
Iterative synthesis
Better grounding for long-form outputs

It mirrors how serious research actually happens: not as a single prompt, but as a guided process with intermediate reasoning and evidence refinement. That instantly felt compatible with my roadmap. Especially Day 14 (research + KB generation) and Day 16–17 (adaptive interview intelligence) where the quality of upstream understanding determines everything downstream.

I don’t see STORM as something I’ll blindly copy-paste. I see it as a conceptual template I can adapt into my own system architecture.

Where I think EdTech “agentic” systems become real

My current conviction:

Agentic AI in EdTech becomes real only when three things coexist:

Reliable retrieval (fresh + trustworthy evidence)
Pedagogical reasoning (not generic summarization)
Operational guardrails (cost, quality, observability, fallbacks)

If any one of these is missing, the experience feels fragile. A lot of demos look magical. My goal is different: I want this to work repeatedly for real course creators, with predictable cost and transparent quality.

What I’m doing next (immediate build intentions)

Over the next implementation stretch, I want to harden the research layer before scaling complexity:

Add stronger research quality checks (source count, credibility mix, coverage)
Introduce fallback strategy when primary search output is weak
Improve parse resilience (schema validation + repair + safe defaults)
Surface uncertainty clearly in the UI
Feed human edits back into future prompt/retrieval tuning

In short: less “one-shot magic,” more “resilient system design.”

Personal note to future me

This phase is uncomfortable because it’s less glamorous than shipping shiny features. But this is probably the most important part of the build.

If I can get this right, the rest of the pipeline — adaptive interviews, outlines, storyboards, revision intelligence — will stand on a solid foundation.

If I ignore it, everything downstream will look polished but feel unreliable.

So this is the work.

And honestly, this is also the part I’ll be proud of when I look back.

Closing thought

I started this project to prove I can build true agentic AI products, not wrappers.

Right now, that proof is being forged in the messy middle: rate limits, malformed outputs, quality filters, and architecture decisions that users will never see — but will absolutely feel.

That’s the diary entry for today.

Like this! Share it using the buttons below