Compile a markdown wiki into RAG-ready chunks for any vector database. The chunking step, not an orchestrator — bring your own vector DB.
wiki42 is an open-source Python library that compiles a folder of markdown wiki pages into RAG-ready chunks for any vector database — one chunk per page, YAML frontmatter as typed metadata, [[wikilinks]] as an outgoing edge list, multilingual E5 embeddings via Pinecone Inference or local sentence-transformers.
Output is parquet or jsonl with {id, text, embedding, metadata} dicts ready to upsert into Pinecone, FAISS, Chroma, Qdrant, Weaviate — or a plain NumPy array. The library hands you portable data; what you do with it is your call.
Extracted from the internal toolchain at 42rows.com, the AI sales-intelligence platform that ships agents grounded on customer-specific wikis. Open-sourced as the chunking step of a RAG pipeline. Not an orchestrator. Not a vector store. Not yet another MCP server.
wiki42 compile ./my-wiki --out chunks.parquet wiki42 compile https://github.com/user/wiki/archive/main.zip --out chunks.jsonl --no-embed wiki42 compile ./big-wiki --split-h2 1500 --model intfloat/multilingual-e5-large --out chunks.parquet Three install paths. pip is the default; Docker ships a non-root image with the MCP server bundled; clone-from-source uses uv workspaces for contributors.
# Cloud embeddings (Pinecone Inference, ~150 MB install)
pip install wiki42
# Local embeddings (sentence-transformers, ~1.2 GB)
pip install "wiki42[local]"
# Both backends available at runtime
pip install "wiki42[all]"
# CLI is available after install
wiki42 --version
wiki42 compile ./my-wiki --out chunks.parquetFrom a markdown wiki to a populated vector index in five lines. Same shape works with FAISS, Chroma, Qdrant, Weaviate, or a plain NumPy array.
from wiki42 import compile_wiki
from pinecone import Pinecone
# 1. Compile a markdown wiki into chunks (one per page, with E5 embeddings)
result = compile_wiki("./my-wiki/") # ~5 s on a 1000-page wiki
# 2. Upsert into your vector DB of choice
index = Pinecone(api_key="...").Index("my-wiki")
index.upsert(vectors=[
(c["id"], c["embedding"], c["metadata"])
for c in result.to_list()
])
# Same shape works for FAISS, Chroma, Qdrant, Weaviate, or a plain NumPy array.Two benches on the same real 1,851-page Italian sales wiki. Same LLM (Gemini 2.5 Flash via Vertex AI), same context-budget cap, answers graded against author-written ground truth. Scripts and raw results live in benchmarks/ — reproduce them or run your own.
5 questions whose terms literally appear in the file or segment name. Worst case for embeddings: grep already finds the right document by name.
| Filesystem grep | wiki42 | Δ | |
|---|---|---|---|
| Input tokens to LLM | 16,354 ± 987 | −84% | |
| End-to-end latency | 10.9 s | tie | |
| Answer quality | grounded | tie |
5 questions written after reading the wiki, phrased so the literal words do not appear in document titles. Retrieval has to work by meaning.
| Filesystem grep | wiki42 | Δ | |
|---|---|---|---|
| Input tokens to LLM | 13,489 ± 2,916 | −82% | |
| Answer quality (0–3 × 5, max 15) | 7 / 15 | +71% | |
| Hallucinated facts | 2 | wiki42 |
When wiki42 does not help. Wiki ≤ 30 pages and your agent already has filesystem
access (Claude Code locally) → grep -rli is faster and cheaper. wiki42 earns its
keep when (a) the wiki ships to consumers without filesystem access (Claude Desktop, Cursor,
Cline, hosted apps), (b) queries are semantic rather than keyword-matched, or (c) the same wiki
is queried many times.
Stable IDs across recompiles. Idempotent upserts. Every YAML frontmatter field on the page
lands in metadata as-is — your vector DB's metadata filter can target any of them.
{
"id": "companies/acme-logistics#0", // stable across recompiles
"text": "passage: Acme Logistics S.p.A. ...", // E5 expects this prefix
"embedding": [0.012, -0.089, /* ... */], // 1024-dim float, or null
"metadata": {
"slug": "companies/acme-logistics",
"title": "Acme Logistics S.p.A.",
"kind": "company", // any frontmatter field
"confidence": 0.88,
"wikilinks_out": ["segments/retail-warehouse", "products/wms-suite"],
"char_count": 2147
}
}wiki42 was extracted from the internal toolchain at 42rows.com, an AI sales-intelligence platform that ships agents grounded on customer-specific wikis. We needed a clean way to turn a markdown wiki into vectors that work in any vector DB — without lock-in, without LLM-teacher chunk inflation, without losing the frontmatter signal. We open-sourced the result. If it is useful to you, a star on GitHub helps people find it; a look at what 42rows actually does helps us.
No. wiki42 is the markdown→chunks step of a RAG pipeline, as a standalone library. It does not orchestrate, route, evaluate, or call LLMs. Use whatever orchestrator you want downstream — LangChain, LlamaIndex, DSPy, or none.
Token-window chunking on opinionated wikis amplifies source bias: the same claim split across 5 chunks gets retrieved 5× and over-weights downstream LLM reasoning. Author-curated pages are already coherent units of meaning. Bench B (5 hard semantic queries on a real 1851-page Italian wiki) shows +71% answer quality vs the grep baseline and 0 hallucinated facts vs 2.
No. Pass embedding_model="intfloat/multilingual-e5-large" to run sentence-transformers locally on CPU — no API calls, no network. embedding_model=None skips embedding entirely so you can plug your own. Pinecone Inference is the default because it is the fastest for non-batched cloud workloads.
Yes. The default model is E5 multilingual large (1024-dim), which covers Italian, English, German, French, Spanish, Portuguese and 90+ more languages out of the box. The published benchmark wiki is Italian.
wiki42 was extracted from the internal toolchain at 42rows.com, an AI sales-intelligence platform that ships agents grounded on customer-specific markdown wikis. We needed a clean way to turn a wiki into vectors that work in any vector DB. We open-sourced the result.
Open source, MIT, Python 3.10+. Issues and pull requests are open — we read all of them.