Saturday afternoon, 3:18 PM. I sent a Telegram message to my AI orchestration system:
"Can we try making a TikTok video? I think we can do something around looking at the New York Times obituaries dataset."
By 5:38 PM I had: 2,102 obituaries pulled from the NYT API and embedded in 1,536-dimensional vector space. A Reddit research brief grounding the project in existing academic work. A complete analysis pipeline with embedding clusters, statistical breakdowns, and temporal trends. An IEEE-format research paper with six original figures. A TikTok video, produced and uploaded. And a second-pass peer review that caught 12 issues in the paper before I ever opened it.
Two hours and twenty minutes. No research assistants. No grad students. One instruction.
This article is about the pipeline that produced all of it, because understanding what an agentic system can actually do with a single instruction matters more than any individual output.
What Actually Ran
Four minutes after my message, the system had decomposed the request into six jobs with dependency chains and dispatched them:
3:18 PM Message received
3:22 PM Decomposition complete. Six jobs dispatched:
[parallel]
├── Job 6323: NYT API data collection (2000-2006)
├── Job 6324: Reddit research scan
│ (r/dataisbeautiful, r/media_criticism,
│ r/journalism, r/todayilearned)
│
[sequential, waiting on 6323]
├── Job 6325: Embedding generation + statistical analysis
│
[sequential, waiting on 6325]
├── Job 6326: IEEE paper drafting
├── Job 6328: TikTok script + video production
│
[sequential, waiting on 6326]
└── Job 6327: Peer review (different model)
4:57 PM Reddit research delivered
5:08 PM 2,102 obituaries collected, embedded, analyzed
5:19 PM Paper drafted (IEEE format, 6 figures, 7 tables)
5:26 PM TikTok video produced and uploaded
5:38 PM Peer review complete (12 issues caught, all fixed)
The decomposition is the part that matters. I did not specify jobs, dependencies, or models. I said "TikTok video" and "obituaries." The system decided it needed data collection before analysis, analysis before paper drafting, and a separate model for peer review. It decided Reddit research could run in parallel with data collection. It decided TikTok production could branch off from analysis results without waiting for the paper.
Each job ran as an independent agent session with its own context window, tools, and model assignment. The orchestration engine managed capacity, tracked dependencies, and routed outputs between jobs. When Job 6323 finished collecting obituaries, the engine automatically released Job 6325 to start embedding them.
The interesting part of agentic AI is not what any single model can do. It is what happens when you coordinate six of them around a shared goal with dependency-aware scheduling.
The Reddit Scan Changed Everything
Before the system touched a single obituary, Job 6324 swept Reddit for what people actually argue about when it comes to death and newspapers.
Three signals came back that directly shaped the research:
The selection problem is what people care about. The most upvoted obituary content on Reddit (49,770 upvotes for the Emmy Noether story) focuses on who gets erased, not how they are described. This pointed the analysis toward gatekeeping, not treatment.
Prior art exists but asks a different question. A 2025 PNAS paper (Markowitz et al.) analyzed 38 million obituaries, but used paid family notices from Legacy.com, not editorial obituaries. Different dataset, different methodology. The NYT editorial obituary is a curated judgment about whose life merits public record. A family-submitted notice is a paid announcement. The distinction matters.
The euphemism angle has traction. "Died suddenly" tracking and opioid death language pulled 35,000 upvotes with 1,194 comments. People are fascinated by what obituaries refuse to say directly.
Without this scan, the paper would have been a generic gender analysis. With it, the paper differentiated itself from the closest prior work and asked a sharper question: in the NYT specifically, does the inequality show up in how women are covered, or in whether they are covered at all?
What the Pipeline Produced
The analysis — conducted by Penelope Lawrence using this pipeline — found that NYT obituary gender inequality is driven by selection (78% male subjects) rather than treatment (word counts were statistically identical, p=0.80), with domain-specific patterns in embedding space revealing how narrowly the paper defines obituary-worthy women. The full research, including five distinct findings and an IEEE-format paper, is available on her research page.
What matters here is that the pipeline produced all of it — data collection, embedding, statistical analysis, paper drafting, and peer review — from a single instruction in under two and a half hours.
The Second Pass Caught Real Problems
After the paper was drafted, a separate job dispatched to a different model family (Claude Sonnet 4.6, where the paper was written by Claude Opus 4.6) ran a peer review. It caught 12 issues:
Statistical gaps. The gender comparison table was missing standard deviations. Added.
Writing tells. 15 em dashes throughout the paper. Replaced with proper punctuation.
A trivially true claim presented as empirical. The paper noted that "widow of" appears only in female obituaries as if this were a discovery. "Widow" refers to a woman by definition. The reviewer flagged it and the framing was corrected to focus on the frequency and cultural implications, not the directionality.
Missing visualization. The UMAP embedding space plot was referenced in the text but not included as a figure. Added as Figure 7.
This is why multi-model review matters. A single model produces output that looks right to itself. A different model, with different training and different blind spots, catches assumptions the first one does not question. The trivially-true-claim catch is the perfect example. Opus wrote it, read it, and thought it was a finding. Sonnet immediately saw the tautology.
The Limitations Are Real
No pipeline eliminates the constraints of its inputs. The gender classifier failed on 43.8% of records. The date range covers only seven years. The embeddings carry their training data's biases. The research paper — authored by Penelope Lawrence — documents all of these limitations transparently.
From a pipeline perspective, the system handled these constraints correctly: it surfaced the limitations in the analysis output rather than hiding them, and the multi-model review pass caught presentation issues the drafting model missed.
What This Tells You About Agentic AI
I have been running a production orchestration system with 500+ AI workers across four model families for several months now. It handles email triage, web publishing, trading analysis, content production, and research. This obituary project was unusual only in that it started from a single casual message and ended with a research paper.
Three things made this possible that would not have worked a year ago:
Automatic decomposition. I did not plan six jobs. I described what I wanted in plain language and the system figured out the dependency graph. Data collection before analysis. Analysis before paper. Review by a different model after the paper. TikTok branching from analysis without waiting for the paper. Getting this decomposition right is the difference between a chatbot and an orchestration system.
Multi-model coordination. The Reddit scan ran on one model. Embeddings used OpenAI's API. The paper was drafted by Claude Opus. The review was done by Claude Sonnet. The TikTok narration used ElevenLabs. Five different AI services coordinated through a single job queue with dependency tracking. No one model could have done all of this.
Quality gates that actually work. The peer review pass was not decorative. It caught a tautological claim, missing statistics, and presentation issues. The paper that shipped was meaningfully better than the first draft. Single-pass AI output is a draft. Treating it as final is how you publish embarrassing work.
The bottleneck in research is no longer execution. It is judgment: knowing which question to ask and whether the answer is real.
Data collection takes minutes now. Statistical analysis takes minutes. Even paper writing takes minutes. What takes actual thought is the upstream decision: is this question worth asking? And the downstream judgment: is this finding real, or is it an artifact of the method?
The obituary corpus was chosen because it is one of the richest cultural texts a newspaper produces — every obituary is an editorial decision about whose life merits the public record. Penelope Lawrence designed the research questions. The system I built executed the measurement.
What Was Produced
For the record, here is the full artifact list from one Telegram message:
| Artifact | Detail |
|---|---|
| Raw data | 2,102 NYT obituaries, 3.0 MB JSONL |
| Embeddings | 2,102 x 1,536 float32 matrix, 12.9 MB |
| Analysis | 27 clusters, gender stats, temporal trends, domain hierarchy |
| Visualizations | 7 figures (UMAP, gender, temporal, word count, keywords, relationships, domain) |
| Research paper | IEEE two-column format, LaTeX source + compiled PDF, 6 figures, 7 tables |
| Peer review | 12 issues identified and resolved by a second model |
| TikTok video | Data narration with ElevenLabs voice, produced and uploaded |
| Reddit brief | Prior art scan across 4 subreddits, 17 KB |
All from "can we try making a TikTok video about NYT obituaries."
The system over-delivered because that is what well-designed orchestration does. It recognized that a TikTok video about obituaries requires data, that data enables analysis, that analysis supports a paper, and that a paper needs review. Each step created the preconditions for the next. I asked for a video. I got a research pipeline.
That is the difference between prompting a model and instructing a system.