🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> |
||
|---|---|---|
| .env.example | ||
| .gitignore | ||
| README.md | ||
| requirements.txt | ||
| wp-chunker.py | ||
WPChunker
Pulls all public WordPress posts & pages, scrubs them to plain text (no HTML, no shortcodes, no Gutenberg comments), and turns them into retrieval-friendly chunks. Supports two segmentation strategies:
- Auto (semantic) — uses Nebius embeddings (
Qwen/Qwen3-Embedding-8B) to cut on topic boundaries. - Agent — uses a local Ollama LLM to choose boundaries (no embeddings needed).
Outputs one .txt where chunks are separated by three blank lines. Includes concurrency, progress bars, colored logs, and sensible filtering.
Why this tool?
Made for RAG & search indexing. It:
-
Finds content via the WP REST API, with XML sitemap fallback.
-
Cleans aggressively: strips navigation, headers/footers, scripts/styles/iframes, Gutenberg comments (
<!-- wp:... -->), and WordPress shortcodes (paired & self-closing). -
Splits to sentences, then:
- Auto: computes sentence embeddings and finds semantic “valleys”.
- Agent: asks a local LLM (Ollama) to pick natural boundaries.
-
Packs chunks to a token budget with optional overlap.
-
Skips low-value text (too short, boilerplatey).
-
Emits a single
.txtwith\n\n\nseparators.
Highlights
- REST API + Sitemap fallback
- TXT-only ingestion for agent mode; robust HTML→TXT cleaner everywhere
- Embedding-based or LLM-based chunking
- Concurrency + pretty progress bars (Rich)
- Heuristics to drop near-empty / repetitive chunks
- One file output for simple ingestion
Requirements
-
Python 3.9+
-
One of:
- Nebius Studio API key (for
--segmenter auto/uniform) - Ollama running locally (for
--segmenter agent)
- Nebius Studio API key (for
Optional (detected automatically):
tiktoken(better token counting; else length/4 fallback)numpy(semantic mode; else uniform fallback)ruptures(change-point fallback in semantic mode)syntok(nicer sentence splitting)rich(pretty console)python-dotenv(load.env)
Installation
git clone https://gitgud.syszima.xyz/skeleton/WPChunker.git
cd WPChunker
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Configuration
Copy and edit .env (all have sane defaults unless noted):
cp .env.example .env
# === Required for semantic (auto/uniform) modes ===
NEBIUS_API_KEY=your_nebius_api_key_here
# Embeddings config
NEBIUS_BASE_URL=https://api.studio.nebius.com/v1/embeddings
NEBIUS_EMBEDDING_MODEL=Qwen/Qwen3-Embedding-8B
# HTTP
USER_AGENT=WPTextExtractor/1.0 (+https://example.com)
# Chunking defaults
CHUNK_MAX_TOKENS=512
CHUNK_OVERLAP_TOKENS=64
MIN_CHUNK_CHARS=100
WORKERS=6
CHUNKS_FILE=chunks_all.txt
# === Agent (Ollama) mode ===
OLLAMA_HOST=http://localhost:11434
AGENTIC_MODEL=llama3.1:8b
AGENTIC_TIMEOUT=120
AGENTIC_MAX_CHUNKS=0 # 0 = auto
OLLAMA_CTX=8192
OLLAMA_TEMPERATURE=0
AGENTIC_DEBUG=0
AGENTIC_DEBUG_DIR=agentic_debug
# Segmentation tuning (semantic "auto")
SEGMENT_ALPHA=1.0
SEGMENT_WINDOWS=2,3,5
Note: CLI flags always override
.env.
Quick start
Auto (semantic embeddings)
python3 wp-chunker.py https://example.com \
-o out_dir \
--segmenter auto \
--chunk-max-tokens 512 \
--chunk-overlap-tokens 64 \
--min-chunk-chars 100 \
--workers 6 \
--chunks-file chunks_all.txt
Agent (local LLM via Ollama, no embeddings required)
# Make sure Ollama is running and the model is pulled:
# ollama pull llama3.1:8b
python3 wp-chunker.py https://example.com \
-o out_dir \
--segmenter agent \
--agentic-model llama3.1:8b \
--ollama-host http://localhost:11434 \
--agentic-timeout 120 \
--chunk-max-tokens 512 \
--chunk-overlap-tokens 64
Uniform (even spacing; embeddings optional)
python3 wp-chunker.py https://example.com \
-o out_dir \
--segmenter uniform \
--chunk-max-tokens 512 \
--chunk-overlap-tokens 64
CLI Reference (selected)
url(positional) — Base URL of the WordPress site.-o, --out— Output directory (storesposts/,pages/, and the final chunks file).--segmenter {auto,uniform,agent}— Choose segmentation strategy (default:auto).--no-sitemap-fallback— Disable sitemap crawl if REST API fails.--nebius-key— Nebius API key (overrides env).--embedding-model— Embedding model id.--chunk-max-tokens— Target max tokens per chunk.--chunk-overlap-tokens— Overlap tokens between adjacent chunks.--min-chunk-chars— Drop chunks shorter than this many chars.--workers— Concurrent workers for chunking.--chunks-file— Name of the single consolidated.txt.
Agent-specific:
--agentic-model— Ollama model name/tag (e.g.,llama3.1:8b).--ollama-host— Ollama host URL.--agentic-timeout— Seconds to wait for LLM response.--agentic-max-chunks— Cap the number of chunks the agent may return (0 = auto).--ollama-ctx— Requested context window.--ollama-temperature— Sampling temperature (0 = deterministic).--debug-agentic/--agentic-debug-dir— Save agent prompts/outputs for inspection.
Cleaning guarantees
- Removes:
<script>,<style>,<noscript>,<iframe>,<svg>,<template>,<form>, and common boilerplate containers. - Strips Gutenberg comments like
<!-- wp:paragraph -->. - Strips WordPress shortcodes (paired and self-closing), e.g.
[gallery ids="..."] ... [/gallery],[contact-form-7 id="..."],[shortcode /]. - Decodes HTML entities, normalizes whitespace, collapses blank lines.
- In agent mode, content is additionally passed through a last-mile sanitizer to guarantee TXT-only ingestion before the LLM sees it.
How it works (pipeline)
-
Fetch: REST API (
/wp-json/wp/v2/{posts,pages}) forstatus=publish; fallback to XML sitemaps. -
Clean: robust HTML→TXT + shortcode/Gutenberg removal.
-
Split: sentence segmentation (optionally via
syntok). -
Segment:
- Auto: embeddings → similarity → valley detection (+ change-point fallback via
ruptures). - Agent: send numbered sentences to Ollama; model returns index spans.
- Auto: embeddings → similarity → valley detection (+ change-point fallback via
-
Pack: respect
--chunk-max-tokens, add tail overlap (--chunk-overlap-tokens). -
Filter:
- Skip pages with < 2 sentences.
- Skip chunks with <
MIN_CHUNK_CHARS(default 100). - Drop repetitive/boilerplatey segments.
-
Write: a single text file with triple-newline separators.
Tips & notes
- If
numpyorrupturesaren’t installed, auto mode still works, but may degrade to simpler heuristics; uniform always works. - If
tiktokenisn’t available, token counts are approximated aslen(text)//4. - There is no embedding cache; repeated runs will re-embed (auto mode).
- Only public
status=publishposts/pages are fetched. - For very large sites, consider lowering
--workersif you hit rate limits.
License
MIT