A simple enough wordpress pages/posts extractor and semantic chunker for LLM-ready applications :)
Find a file
skeleton ff4d4c944c Yey boy, updates like its raining
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-18 16:57:03 +01:00
.env.example Yey boy, updates like its raining 2025-11-18 16:57:03 +01:00
.gitignore Yeah i guess its fine for now :d 2025-10-03 03:32:56 +02:00
README.md agentic-mode brr 2 2025-10-03 17:56:33 +02:00
requirements.txt agentic-mode brr 2025-10-03 17:55:15 +02:00
wp-chunker.py Yey boy, updates like its raining 2025-11-18 16:57:03 +01:00

WPChunker

Pulls all public WordPress posts & pages, scrubs them to plain text (no HTML, no shortcodes, no Gutenberg comments), and turns them into retrieval-friendly chunks. Supports two segmentation strategies:

  • Auto (semantic) — uses Nebius embeddings (Qwen/Qwen3-Embedding-8B) to cut on topic boundaries.
  • Agent — uses a local Ollama LLM to choose boundaries (no embeddings needed).

Outputs one .txt where chunks are separated by three blank lines. Includes concurrency, progress bars, colored logs, and sensible filtering.


Why this tool?

Made for RAG & search indexing. It:

  1. Finds content via the WP REST API, with XML sitemap fallback.

  2. Cleans aggressively: strips navigation, headers/footers, scripts/styles/iframes, Gutenberg comments (<!-- wp:... -->), and WordPress shortcodes (paired & self-closing).

  3. Splits to sentences, then:

    • Auto: computes sentence embeddings and finds semantic “valleys”.
    • Agent: asks a local LLM (Ollama) to pick natural boundaries.
  4. Packs chunks to a token budget with optional overlap.

  5. Skips low-value text (too short, boilerplatey).

  6. Emits a single .txt with \n\n\n separators.


Highlights

  • REST API + Sitemap fallback
  • TXT-only ingestion for agent mode; robust HTML→TXT cleaner everywhere
  • Embedding-based or LLM-based chunking
  • Concurrency + pretty progress bars (Rich)
  • Heuristics to drop near-empty / repetitive chunks
  • One file output for simple ingestion

Requirements

  • Python 3.9+

  • One of:

    • Nebius Studio API key (for --segmenter auto / uniform)
    • Ollama running locally (for --segmenter agent)

Optional (detected automatically):

  • tiktoken (better token counting; else length/4 fallback)
  • numpy (semantic mode; else uniform fallback)
  • ruptures (change-point fallback in semantic mode)
  • syntok (nicer sentence splitting)
  • rich (pretty console)
  • python-dotenv (load .env)

Installation

git clone https://gitgud.syszima.xyz/skeleton/WPChunker.git
cd WPChunker
python -m venv .venv
source .venv/bin/activate 
pip install -r requirements.txt

Configuration

Copy and edit .env (all have sane defaults unless noted):

cp .env.example .env
# === Required for semantic (auto/uniform) modes ===
NEBIUS_API_KEY=your_nebius_api_key_here

# Embeddings config
NEBIUS_BASE_URL=https://api.studio.nebius.com/v1/embeddings
NEBIUS_EMBEDDING_MODEL=Qwen/Qwen3-Embedding-8B

# HTTP
USER_AGENT=WPTextExtractor/1.0 (+https://example.com)

# Chunking defaults
CHUNK_MAX_TOKENS=512
CHUNK_OVERLAP_TOKENS=64
MIN_CHUNK_CHARS=100
WORKERS=6
CHUNKS_FILE=chunks_all.txt

# === Agent (Ollama) mode ===
OLLAMA_HOST=http://localhost:11434
AGENTIC_MODEL=llama3.1:8b
AGENTIC_TIMEOUT=120
AGENTIC_MAX_CHUNKS=0        # 0 = auto
OLLAMA_CTX=8192
OLLAMA_TEMPERATURE=0
AGENTIC_DEBUG=0
AGENTIC_DEBUG_DIR=agentic_debug

# Segmentation tuning (semantic "auto")
SEGMENT_ALPHA=1.0          
SEGMENT_WINDOWS=2,3,5     

Note: CLI flags always override .env.


Quick start

Auto (semantic embeddings)

python3 wp-chunker.py https://example.com \
  -o out_dir \
  --segmenter auto \
  --chunk-max-tokens 512 \
  --chunk-overlap-tokens 64 \
  --min-chunk-chars 100 \
  --workers 6 \
  --chunks-file chunks_all.txt

Agent (local LLM via Ollama, no embeddings required)

# Make sure Ollama is running and the model is pulled:
#   ollama pull llama3.1:8b

python3 wp-chunker.py https://example.com \
  -o out_dir \
  --segmenter agent \
  --agentic-model llama3.1:8b \
  --ollama-host http://localhost:11434 \
  --agentic-timeout 120 \
  --chunk-max-tokens 512 \
  --chunk-overlap-tokens 64

Uniform (even spacing; embeddings optional)

python3 wp-chunker.py https://example.com \
  -o out_dir \
  --segmenter uniform \
  --chunk-max-tokens 512 \
  --chunk-overlap-tokens 64

CLI Reference (selected)

  • url (positional) — Base URL of the WordPress site.
  • -o, --out — Output directory (stores posts/, pages/, and the final chunks file).
  • --segmenter {auto,uniform,agent} — Choose segmentation strategy (default: auto).
  • --no-sitemap-fallback — Disable sitemap crawl if REST API fails.
  • --nebius-key — Nebius API key (overrides env).
  • --embedding-model — Embedding model id.
  • --chunk-max-tokens — Target max tokens per chunk.
  • --chunk-overlap-tokens — Overlap tokens between adjacent chunks.
  • --min-chunk-chars — Drop chunks shorter than this many chars.
  • --workers — Concurrent workers for chunking.
  • --chunks-file — Name of the single consolidated .txt.

Agent-specific:

  • --agentic-model — Ollama model name/tag (e.g., llama3.1:8b).
  • --ollama-host — Ollama host URL.
  • --agentic-timeout — Seconds to wait for LLM response.
  • --agentic-max-chunks — Cap the number of chunks the agent may return (0 = auto).
  • --ollama-ctx — Requested context window.
  • --ollama-temperature — Sampling temperature (0 = deterministic).
  • --debug-agentic / --agentic-debug-dir — Save agent prompts/outputs for inspection.

Cleaning guarantees

  • Removes: <script>, <style>, <noscript>, <iframe>, <svg>, <template>, <form>, and common boilerplate containers.
  • Strips Gutenberg comments like <!-- wp:paragraph -->.
  • Strips WordPress shortcodes (paired and self-closing), e.g. [gallery ids="..."] ... [/gallery], [contact-form-7 id="..."], [shortcode /].
  • Decodes HTML entities, normalizes whitespace, collapses blank lines.
  • In agent mode, content is additionally passed through a last-mile sanitizer to guarantee TXT-only ingestion before the LLM sees it.

How it works (pipeline)

  1. Fetch: REST API (/wp-json/wp/v2/{posts,pages}) for status=publish; fallback to XML sitemaps.

  2. Clean: robust HTML→TXT + shortcode/Gutenberg removal.

  3. Split: sentence segmentation (optionally via syntok).

  4. Segment:

    • Auto: embeddings → similarity → valley detection (+ change-point fallback via ruptures).
    • Agent: send numbered sentences to Ollama; model returns index spans.
  5. Pack: respect --chunk-max-tokens, add tail overlap (--chunk-overlap-tokens).

  6. Filter:

    • Skip pages with < 2 sentences.
    • Skip chunks with < MIN_CHUNK_CHARS (default 100).
    • Drop repetitive/boilerplatey segments.
  7. Write: a single text file with triple-newline separators.


Tips & notes

  • If numpy or ruptures arent installed, auto mode still works, but may degrade to simpler heuristics; uniform always works.
  • If tiktoken isnt available, token counts are approximated as len(text)//4.
  • There is no embedding cache; repeated runs will re-embed (auto mode).
  • Only public status=publish posts/pages are fetched.
  • For very large sites, consider lowering --workers if you hit rate limits.

License

MIT