Connectors

Browse docs

Connectors

Tap to expand

Contribute

ConnectorsUpdated 2026-03-18

Web Crawler Connector

Crawl a site or section of a site into RetainDB when a single URL is too narrow but a full unconstrained crawl would be messy.

Use the web crawler connector when you need multiple related pages from the same site and you want RetainDB to follow links for you.

This is the right tool for a docs section, help center, or small internal website. It is the wrong tool for “crawl the whole internet starting from our homepage.”

Use this connector when

  • a single page is not enough
  • the site has a clear boundary you can describe
  • you want crawl-based discovery instead of a sitemap file

Create the source

bash
curl -X POST "https://api.retaindb.com/v1/projects/proj_123/sources" \
  -H "Authorization: Bearer $RETAINDB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Acme Docs Crawl",
    "connector_type": "web",
    "config": {
      "start_url": "https://docs.acme.com",
      "max_depth": 3,
      "allow_paths": ["/guides", "/reference"]
    }
  }'

Why allow_paths matters

Without a boundary, crawls get noisy fast.

Use path constraints and a reasonable max_depth so you ingest the part of the site you actually want instead of navigation clutter, changelogs, or irrelevant marketing pages.

Start sync and monitor it

bash
curl -X POST "https://api.retaindb.com/v1/sources/src_123/sync" \
  -H "Authorization: Bearer $RETAINDB_API_KEY"
bash
curl "https://api.retaindb.com/v1/sources/src_123/status" \
  -H "Authorization: Bearer $RETAINDB_API_KEY"

What a good first crawl looks like

Start smaller than you think:

  • one docs section
  • low depth
  • explicit allow paths

Once retrieval looks good, expand the crawl boundary.

Common mistakes

Crawl explosion

If the site fans out quickly, the crawler may spend effort on pages you do not care about. Tighten allow_paths first.

Duplicate or low-value pages

If the site has many similar pages, navigation shells, or printer views, retrieval quality can get noisier than expected.

Using it for JS-heavy pages

If critical content only appears after client-side rendering, the AI Browser connector may be a better fit.

Next step

If the site already has a reliable sitemap, use sitemap instead. If you only need one page, go back to URL connector.

Was this page helpful?

Your feedback helps us prioritize docs improvements weekly.