OpenKitMule Guide · 2026-07-03
Local LLM News Summarization at Zero API Cost: Ollama and a 24-Hour Rolling Window
If you build any product that consumes news feeds, three costs show up quickly:
- API calls to summarize each article. A hundred articles a day at Claude Sonnet is around a dollar; at scale, hundreds per hour, it stops being small.
- Duplicate work. The same story appears across ten sources with slightly different wording. Naive pipelines resummarize each copy.
- Cloud dependency. Every article you scrape goes through a third party. If the vendor throttles, breaks, or changes pricing, your pipeline breaks with it.
The fix is to run the summarizer locally on Ollama, dedupe on a rolling 24-hour window, and keep the entire pipeline on one machine. This post walks through the design and links to a free working kit at the end.
1. Why Ollama is the practical baseline
Ollama gives you a stable HTTP endpoint at http://localhost:11434 with the same request shape whether you are running Llama 3, Mistral, Qwen, DeepSeek-Coder, or Phi. That interface stability is what makes it a good foundation for a real pipeline.
Rough hardware guide as of 2026:
| Model | Approx VRAM | Practical machine |
|---|---|---|
| Llama 3.1 8B (q4) | ~5 GB | Any M-series Mac, RTX 4060, or better |
| Qwen2.5 7B (q4) | ~5 GB | Same as above; strong for Chinese content |
| Mistral 7B (q4) | ~5 GB | Fast, good English summarization |
| Llama 3.1 70B (q4) | ~40 GB | Requires a workstation-class GPU |
For news summarization the 7-8B tier is the sweet spot. It handles a 500-1000 token article in 1-3 seconds on consumer hardware, which is fast enough that you never need to batch.
2. The 24-hour rolling window is the real trick
A news pipeline that summarizes every article it ever sees is a pipeline that pays to compress the same story ten times. The design that keeps compute low is:
- Store a fingerprint of every article seen in the last 24 hours. A cheap starting point is a SHA-256 of the normalized title plus the first 200 characters of the body.
- On new article arrival, check the fingerprint set. If a near-match exists, skip summarization entirely; only bump the "seen at" timestamp so this variant does not immediately re-enter the window.
- Age out entries older than 24 hours. This is a rolling window, not a total dedupe log. A story that legitimately re-emerges on day two gets summarized again; a story that just circulates for 12 hours does not.
In practice this removes 60-80% of the LLM calls on a real Chinese-A-share news feed. That is the difference between "a hobby" and "a pipeline that pays for its own hardware."
3. Prompt template that keeps output small
Local LLMs are cheaper than API models but not free. Output tokens are what actually cost you time on consumer hardware. Design the prompt to force a one-line summary:
Summarize the following news article in exactly one sentence,
under 30 words, in English. Do not add commentary or lists.
Title: {title}
Body: {body_first_500_chars}
Empirically this cuts output tokens by roughly 5x compared with an unbounded "summarize this article" prompt, and Llama 3.1 8B respects the constraint about 95% of the time. That is close enough for a topic dashboard.
If you need structured output (keyword tags, sentiment, entity list), ask for JSON with an explicit schema and validate on the client. Local models drift more than frontier API models; validation is cheap insurance.
4. Six-source aggregation without a crawler farm
Most public news sources expose either an RSS feed, a lightweight JSON endpoint, or a stable HTML pattern. You do not need Playwright for any of these; a single Python process with requests + BeautifulSoup fetches all six in a few seconds:
- Financial news feeds (Cailianshe, EastMoney, Snowball for A-share; Bloomberg RSS, Yahoo Finance RSS for US).
- General news wire (Reuters top-headlines RSS, AP wire).
- Domain-specific (Hacker News firehose, arXiv "new submissions" RSS by category).
Fetch on a 60-second timer. Feed each new article into the dedupe check, then into the Ollama summarizer if it survives. Persist the one-line summary to disk (JSON, SQLite, or Markdown) so downstream consumers do not need to re-run the pipeline.
5. Cost comparison: local vs API summarization
Assume 500 articles/day survive dedupe. Each article: ~800 input tokens, ~50 output tokens (the one-sentence prompt above).
| Setup | Marginal cost / day | Latency | Notes |
|---|---|---|---|
| Local Llama 3.1 8B on M2 Mac | $0.00 (ambient electricity) | ~1.5 s / article | Runs indefinitely with no API bill |
| Gemini 1.5 Flash API | ~$0.04 | ~0.7 s / article | Cheap but non-zero and rate-limited |
| GPT-4o mini API | ~$0.08 | ~0.8 s / article | Higher quality summaries |
| Claude 3.5 Sonnet API | ~$1.66 | ~1.2 s / article | Overkill for one-sentence summaries |
Local is not cheaper because Sonnet is expensive; local is cheaper because you already paid for the hardware. That is the invariant that flips at scale.
For a longer discussion of API cost math, see the LLM API cost estimation guide.
6. When to still use an API model
- You need multilingual summarization across 20+ languages with high accuracy. Frontier API models are still ahead here.
- You need reasoning over the article, not just compression. Local 7-8B models are good summarizers, mediocre reasoners.
- You need low variance for regulatory reporting. API models are more consistent between calls.
- You do not own the machine that would run the model 24/7.
Everything else - dashboards, personal readers, internal signal pipelines - is a natural fit for local Ollama.
7. A working kit you can run in five minutes
The News Refiner Kit implements the design above end-to-end: six-source aggregation, SHA-based 24-hour rolling dedupe, Ollama summarization with the one-sentence prompt, and JSON/Markdown output adapters. It ships with the OpenKitMule agent-prompt pattern, so you can paste one prompt into Claude Code or Cursor and let it install Ollama, pull the model, and start the pipeline.
For a smaller warm-up kit that has zero external dependencies at all (Python stdlib only), see PromptForge. For the workflow pattern behind these kits, see the prompt-path pattern write-up.
8. Related reading
- Estimating LLM API costs before you ship - if you decide to stay on the API side after all.
- The prompt-path pattern - how to hand a whole codebase to a coding agent in two lines.
- News Refiner Kit - the pipeline described above, downloadable free.
- Agent Scraper Kit - a heavier scraper when the six-source RSS path is not enough.
Run it locally, keep your API bill at zero
The full News Refiner Kit is free. Paste one prompt into your coding agent and it will install Ollama, pull the summarizer model, and start the 24-hour rolling pipeline.