There are too many good finance podcasts. That sounds like a small problem until you care about markets. Then it becomes a research infrastructure problem.
Useful ideas rarely arrive as clean database rows. They show up as a guest riffing on a company, a host pushing back on consensus, or an analyst casually naming the bottleneck that makes a whole theme work. Some of it is noise. Some of it is salesmanship. Some of it is wrong. But some of it is early signal, and no human can keep up with all of it.
So I built Token Machine: a local agent pipeline that listens to finance shows, transcribes them, identifies speakers, extracts financial claims, separates broad themes from explicit instruments, rejects dirty rows, and stores everything in a local database. The goal is not an agent that says "buy this." The goal is a research machine that leaves receipts.
This is a technical project writeup, not investment advice. The performance figures below are historical paper backtests and diagnostics, not live brokerage performance or a prediction of future returns.
The Shape Of The System
The pipeline is local-first because the valuable part is not just the model output. It is the audit trail: transcript text, speaker metadata, raw model payloads, ticker-resolution reasons, scoreable flags, outcome rows, backtests, and portfolio decisions all joined together in one local memory system.
The Leaderboard Was The Trap
The first tempting product was a leaderboard: which finance shows make the best market calls? That is fun, but it quickly exposed the harder problem. A leaderboard only matters if the data underneath it refuses to lie.
Some early rows attributed calls to host groups instead of one human. Some broad themes were forced into proxy tickers. Some private companies were treated as if they had public baseline prices. Some transcript chunks contained cross-talk that had been collapsed into one block. The system got more useful when it started rejecting its own outputs.
That drop from 2,056 extracted calls to 232 strict scoreable calls is the story. In finance, an agent that extracts more is not automatically better. An agent that knows when to reject its own output is far more useful.
What Makes A Call Scoreable?
Token Machine has two lanes. If a speaker explicitly names a public
company, ETF, index, crypto, commodity fund, or raw ticker, the extraction
can enter the explicit_instrument lane. If the speaker is
talking about a sector, country, style, private company, or broad macro
theme, it goes into the macro_theme lane.
The model is not allowed to guess proxy tickers. If someone says "semiconductors," the system cannot silently turn that into NVDA or SOXX. If someone says "Nvidia," a reviewed deterministic alias can map that to NVDA. The local model extracts what was said. The resolver decides whether it is tradeable.
| Gate | Publication Result |
|---|---|
| Strict resolver rejects current scoreable rows | 0 |
| Strict resolver would change ticker | 0 |
| Scoreable rows missing speaker entity | 0 |
| Outcomes attached to unscoreable parents | 0 |
Where The Trading Proof Comes In
The podcast agent is upstream research infrastructure. It does not place trades. But the project includes a historical proof harness that asks a stricter question: can pipeline-derived evidence become a replayable trading rule set, with a ledger, drawdowns, benchmark comparison, and position-level winners and losers?
Historical Proof Pack
Strategy: pipeline_barbell_max_deployment
Window: 2020-01-01 through 2026-05-06
Starting equity: $10,000
Benchmark: SPY ETF proxy for the S&P 500
Headline Result
The cached proof report ended at $48,743.88, compared with $24,653.06 for SPY over the same window. That is +387.4% for the strategy versus +146.5% for SPY.
Best And Worst Ideas
A useful backtest should expose the uncomfortable parts, not just the top number. The proof report includes position lifecycle tables, trade ledger, allocation snapshots, drawdown charts, and contribution by ticker.
Best Idea
CVNA: +$12,344
2023-11-06 to 2024-02-05. Roughly +637.9% on bought notional.
Worst Idea
F: -$933
2025-11-03 to open in the packaged report. Roughly -13.5% on bought
notional.
Best Realized Sell
META: +$1,897
2023-08-07 sell, 9.1228 shares at $314.10.
Worst Realized Sell
DHI: -$207
2026-02-02 sell, 16.5947 shares at $149.34.
The Strategy Stack
The historical proof is one layer of a larger research system. The current stack separates discovery, scoring, portfolio construction, and allocation review.
| Layer | What It Does |
|---|---|
| Market-call scoring | Measures strict podcast calls against SPY-relative forward outcomes. |
| Portfolio signal modes | Convert eligible evidence into weekly portfolio targets and ablations. |
| Capital queue | Ranks add, trim, hedge, research, hold, and blocked decisions. |
| Pipeline barbell | Selects single-stock leader and venture names from pipeline evidence. |
| Alpha Committee | Uses local or higher-reasoning model review only after deterministic selection. |
In a fresh no-store diagnostic run through 2026-05-12, the SPY-core modes carried most of the return while pure podcast/alpha modes were much more muted. I like that result because it keeps the project honest. The podcast system is not magic. It improves research intake. Portfolio construction, sizing, taxes, liquidity, and execution still matter.
What I Built
- Local podcast ingestion, transcription, diarization, and speaker polishing pipeline.
- Structured market-call extraction with strict explicit-instrument and macro-theme lanes.
- Deterministic ticker resolver that rejects proxy guesses and stores rejection reasons.
- Local PGLite/Postgres memory for transcripts, claims, outcomes, forecasts, and portfolio state.
- Publication audit that blocks public leaderboards unless strict resolver and speaker gates are clean.
- Historical trading proof pack with equity curve, drawdown, trade ledger, and position lifecycle.
- Capital queue and Alpha Committee review layer for bounded paper-trading decisions.
Why This Belongs In My Portfolio
My core work is data systems and geospatial software, but this project hits the same engineering muscles: messy source data, entity resolution, spatial-like attribution problems, quality gates, local databases, dashboards, reproducible analysis, and user-facing storytelling.
The project started with a fun question: can I rank finance shows by what happened after they made calls? The better question became: can a local agent pipeline make my research intake harder to fool?
That is the part I am proud of. The output is not just a clever summary. It is a system that can say no.
Final caveat: this is historical paper research. It uses cached market data, simplified fills, and diagnostic assumptions. The point is process: local agents extract, deterministic systems gate, backtests decide, and humans remain accountable.