# BRUZZ scraping instructions

- Primary list source: `https://www.bruzz.be/rss.xml`.
- Important: the RSS feed has title/URL/date/short description only. It does **not** contain the full story.
- Workflow:
  1. Fetch RSS.
  2. Filter items to the target date and URLs under `https://www.bruzz.be/actua/`.
  3. Scrape every selected article page with Lightpanda: `/usr/local/bin/lightpanda fetch --dump markdown --obey_robots <article-url>`.
  4. Use the scraped page text for summaries/excerpts, falling back to RSS description only if page scraping fails.
- Do **not** use `https://www.bruzz.be/actua` as the main listing page: as of 2026-05-14 it returns a Drupal 404 page.
- Keep Brussels news only; skip navigation, live radio blocks, ads, and repeated teasers.
- Language: source is Dutch. Translate summaries into the target review language when writing the final review.
- Cache paths:
  - raw RSS: `docs/cache/press-raw/bruzz-rss.xml`
  - article metadata: `docs/cache/press-raw/bruzz-articles.json`
  - full scraped article pages: `docs/cache/press-raw/bruzz-pages/*.md`
