wiki42

Compile a markdown wiki into RAG-ready chunks for any vector database. The chunking step, not an orchestrator — bring your own vector DB.

Open source · MIT Python 3.10+ pip install wiki42 ~5 s on a 1000-page wiki

What it does

wiki42 is an open-source Python library that compiles a folder of markdown wiki pages into RAG-ready chunks for any vector database — one chunk per page, YAML frontmatter as typed metadata, [[wikilinks]] as an outgoing edge list, multilingual E5 embeddings via Pinecone Inference or local sentence-transformers.

Output is parquet or jsonl with {id, text, embedding, metadata} dicts ready to upsert into Pinecone, FAISS, Chroma, Qdrant, Weaviate — or a plain NumPy array. The library hands you portable data; what you do with it is your call.

Extracted from the internal toolchain at 42rows.com, the AI sales-intelligence platform that ships agents grounded on customer-specific wikis. Open-sourced as the chunking step of a RAG pipeline. Not an orchestrator. Not a vector store. Not yet another MCP server.

What's different

  • 01 One chunk per page, no LLM teacher. Author-curated wiki pages are the atomic unit of meaning. Token-window chunking and LLM-teacher Q&A pre-generation amplify source bias and bloat indices 10–30×. Bench B on a 1851-page Italian wiki: +71% answer quality vs filesystem grep, 0 hallucinated facts vs 2.
  • 02 Frontmatter as filterable metadata. [[wikilinks]] as a graph. Every YAML field on a page becomes a filterable field on its chunk — kind=="decision" AND confidence>0.8 works in any vector DB with metadata filters. [[slug]] references are parsed into metadata.wikilinks_out, an outgoing edge list ready for graph traversal.
  • 03 Bring your own vector DB. Output is parquet or jsonl. Zero lock-in. Pinecone, FAISS, Chroma, Qdrant, Weaviate, or plain NumPy — your call. Pinecone server-side embedding chunks mid-sentence and keeps vectors in Pinecone; wiki42 keeps pages intact and hands you portable dicts.

Inputs

  • Local markdown directory
  • Local .zip file
  • https:// URL to a wiki zip (GitHub archives auto-unwrapped)

Outputs

  • Parquet
  • JSONL
  • In-memory list[dict]

Example commands

01 wiki42 compile ./my-wiki --out chunks.parquet
02 wiki42 compile https://github.com/user/wiki/archive/main.zip --out chunks.jsonl --no-embed
03 wiki42 compile ./big-wiki --split-h2 1500 --model intfloat/multilingual-e5-large --out chunks.parquet

Install

Three install paths. pip is the default; Docker ships a non-root image with the MCP server bundled; clone-from-source uses uv workspaces for contributors.

# Cloud embeddings (Pinecone Inference, ~150 MB install)
pip install wiki42

# Local embeddings (sentence-transformers, ~1.2 GB)
pip install "wiki42[local]"

# Both backends available at runtime
pip install "wiki42[all]"

# CLI is available after install
wiki42 --version
wiki42 compile ./my-wiki --out chunks.parquet

Quick start

From a markdown wiki to a populated vector index in five lines. Same shape works with FAISS, Chroma, Qdrant, Weaviate, or a plain NumPy array.

from wiki42 import compile_wiki
from pinecone import Pinecone

# 1. Compile a markdown wiki into chunks (one per page, with E5 embeddings)
result = compile_wiki("./my-wiki/")          # ~5 s on a 1000-page wiki

# 2. Upsert into your vector DB of choice
index = Pinecone(api_key="...").Index("my-wiki")
index.upsert(vectors=[
    (c["id"], c["embedding"], c["metadata"])
    for c in result.to_list()
])

# Same shape works for FAISS, Chroma, Qdrant, Weaviate, or a plain NumPy array.

Benchmark

Two benches on the same real 1,851-page Italian sales wiki. Same LLM (Gemini 2.5 Flash via Vertex AI), same context-budget cap, answers graded against author-written ground truth. Scripts and raw results live in benchmarks/ — reproduce them or run your own.

Bench A — easy queries (keyword in title)

5 questions whose terms literally appear in the file or segment name. Worst case for embeddings: grep already finds the right document by name.

Filesystem grepwiki42Δ
Input tokens to LLM16,354 ± 987−84%
End-to-end latency10.9 stie
Answer qualitygroundedtie

Bench B — hard semantic queries (no keyword overlap)

5 questions written after reading the wiki, phrased so the literal words do not appear in document titles. Retrieval has to work by meaning.

Filesystem grepwiki42Δ
Input tokens to LLM13,489 ± 2,916−82%
Answer quality (0–3 × 5, max 15)7 / 15+71%
Hallucinated facts2wiki42

When wiki42 does not help. Wiki ≤ 30 pages and your agent already has filesystem access (Claude Code locally) → grep -rli is faster and cheaper. wiki42 earns its keep when (a) the wiki ships to consumers without filesystem access (Claude Desktop, Cursor, Cline, hosted apps), (b) queries are semantic rather than keyword-matched, or (c) the same wiki is queried many times.

What's in each chunk

Stable IDs across recompiles. Idempotent upserts. Every YAML frontmatter field on the page lands in metadata as-is — your vector DB's metadata filter can target any of them.

{
  "id":       "companies/acme-logistics#0",       // stable across recompiles
  "text":     "passage: Acme Logistics S.p.A. ...", // E5 expects this prefix
  "embedding": [0.012, -0.089, /* ... */],        // 1024-dim float, or null
  "metadata": {
    "slug":          "companies/acme-logistics",
    "title":         "Acme Logistics S.p.A.",
    "kind":          "company",                   // any frontmatter field
    "confidence":    0.88,
    "wikilinks_out": ["segments/retail-warehouse", "products/wms-suite"],
    "char_count":    2147
  }
}

Why we built this

wiki42 was extracted from the internal toolchain at 42rows.com, an AI sales-intelligence platform that ships agents grounded on customer-specific wikis. We needed a clean way to turn a markdown wiki into vectors that work in any vector DB — without lock-in, without LLM-teacher chunk inflation, without losing the frontmatter signal. We open-sourced the result. If it is useful to you, a star on GitHub helps people find it; a look at what 42rows actually does helps us.

FAQ

Is this another LangChain or LlamaIndex?

No. wiki42 is the markdown→chunks step of a RAG pipeline, as a standalone library. It does not orchestrate, route, evaluate, or call LLMs. Use whatever orchestrator you want downstream — LangChain, LlamaIndex, DSPy, or none.

Why one chunk per page instead of token-windowed chunks?

Token-window chunking on opinionated wikis amplifies source bias: the same claim split across 5 chunks gets retrieved 5× and over-weights downstream LLM reasoning. Author-curated pages are already coherent units of meaning. Bench B (5 hard semantic queries on a real 1851-page Italian wiki) shows +71% answer quality vs the grep baseline and 0 hallucinated facts vs 2.

Do I have to use Pinecone for embeddings?

No. Pass embedding_model="intfloat/multilingual-e5-large" to run sentence-transformers locally on CPU — no API calls, no network. embedding_model=None skips embedding entirely so you can plug your own. Pinecone Inference is the default because it is the fastest for non-batched cloud workloads.

Does it work with non-English wikis?

Yes. The default model is E5 multilingual large (1024-dim), which covers Italian, English, German, French, Spanish, Portuguese and 90+ more languages out of the box. The published benchmark wiki is Italian.

Where does the name "42rows" come from?

wiki42 was extracted from the internal toolchain at 42rows.com, an AI sales-intelligence platform that ships agents grounded on customer-specific markdown wikis. We needed a clean way to turn a wiki into vectors that work in any vector DB. We open-sourced the result.

Install it. Star it. Tell us what breaks.

Open source, MIT, Python 3.10+. Issues and pull requests are open — we read all of them.