Browse docs

Connectors

Tap to expand

Getting Started

Core Concepts

API

Auth1 page

API Auth Caller Model

Memory5 pages

Index API1 page

Index API

Context and Sources2 pages

Search and Operations2 pages

SDK

Quickstart1 page

SDK Quickstart

Scoping1 page

User and Session Scoping

Modules4 pages

Adapters2 pages

Migration1 page

Migration: RetainDBClient to RetainDB

MCP

Setup3 pages

Primary Tools1 page

Semantic Search Tools

Security and Scope1 page

Security and Scope Controls

Integrations

Frameworks4 pages

Agent Hosts2 pages

Connectors

Web5 pages

Knowledge Bases6 pages

Structured Sources4 pages

Packages and Research4 pages

Dashboard

Overview2 pages

Sources2 pages

Workflows3 pages

Developer1 page

Dev: Keys, SDK, and MCP

Tutorials

Migrations

Operations

Legacy

Legacy Documentation

Contribute

Contributing

ConnectorsUpdated 2026-03-18

Web Crawler Connector

Crawl a site or section of a site into RetainDB when a single URL is too narrow but a full unconstrained crawl would be messy.

Use the web crawler connector when you need multiple related pages from the same site and you want RetainDB to follow links for you.

This is the right tool for a docs section, help center, or small internal website. It is the wrong tool for “crawl the whole internet starting from our homepage.”

Use this connector when

a single page is not enough
the site has a clear boundary you can describe
you want crawl-based discovery instead of a sitemap file

Create the source

bash

curl -X POST "https://api.retaindb.com/v1/projects/proj_123/sources" \
  -H "Authorization: Bearer $RETAINDB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Acme Docs Crawl",
    "connector_type": "web",
    "config": {
      "start_url": "https://docs.acme.com",
      "max_depth": 3,
      "allow_paths": ["/guides", "/reference"]
    }
  }'

Why `allow_paths` matters

Without a boundary, crawls get noisy fast.

Use path constraints and a reasonable max_depth so you ingest the part of the site you actually want instead of navigation clutter, changelogs, or irrelevant marketing pages.

Start sync and monitor it

bash

curl -X POST "https://api.retaindb.com/v1/sources/src_123/sync" \
  -H "Authorization: Bearer $RETAINDB_API_KEY"

bash

curl "https://api.retaindb.com/v1/sources/src_123/status" \
  -H "Authorization: Bearer $RETAINDB_API_KEY"

What a good first crawl looks like

Start smaller than you think:

one docs section
low depth
explicit allow paths

Once retrieval looks good, expand the crawl boundary.

Common mistakes

Crawl explosion

If the site fans out quickly, the crawler may spend effort on pages you do not care about. Tighten allow_paths first.

Duplicate or low-value pages

If the site has many similar pages, navigation shells, or printer views, retrieval quality can get noisier than expected.

Using it for JS-heavy pages

If critical content only appears after client-side rendering, the AI Browser connector may be a better fit.

Next step

If the site already has a reliable sitemap, use sitemap instead. If you only need one page, go back to URL connector.

Was this page helpful?

Your feedback helps us prioritize docs improvements weekly.

Web Crawler Connector

Use this connector when

Create the source

Why allow_paths matters

Start sync and monitor it

What a good first crawl looks like

Common mistakes

Crawl explosion

Duplicate or low-value pages

Using it for JS-heavy pages

Next step

Why `allow_paths` matters