# AI Systems Catalog

This catalog documents known or likely AI-related crawlers, search agents, and
assistant retrieval systems. Prefer official sources. When an official source is
missing, mark the entry as unverified.

Fetched/reviewed: 2026-06-23.

## Top Five Initial Targets

### OpenAI / ChatGPT

Official source: https://platform.openai.com/docs/bots

Known agents:

- `OAI-SearchBot`: search indexing for ChatGPT search features.
- `GPTBot`: web crawling for training generative AI foundation models.
- `ChatGPT-User`: user-initiated actions in ChatGPT and Custom GPTs.
- `OAI-AdsBot`: landing page checks for ads.

Published IP sources:

- https://openai.com/searchbot.json
- https://openai.com/gptbot.json
- https://openai.com/chatgpt-user.json

Expected behavior:

- `OAI-SearchBot` and `GPTBot` should be distinguishable by user-agent and
  published IP range.
- `ChatGPT-User` is user-triggered and may not behave like a crawler.
- OpenAI states the robot controls are independent, so search and training
  should be tested separately.

### Google / Gemini / Google Search

Official sources:

- https://developers.google.com/crawling/docs/crawlers-fetchers/overview-google-crawlers
- https://developers.google.com/crawling/docs/crawlers-fetchers/verify-google-requests

Known categories:

- Common crawlers such as Googlebot.
- Special-case crawlers such as AdsBot.
- User-triggered fetchers.

Published IP sources:

- https://developers.google.com/static/crawling/ipranges/common-crawlers.json
- https://developers.google.com/static/crawling/ipranges/special-crawlers.json
- https://developers.google.com/static/crawling/ipranges/user-triggered-fetchers.json
- https://developers.google.com/static/crawling/ipranges/user-triggered-fetchers-google.json
- https://developers.google.com/static/crawling/ipranges/user-triggered-agents.json

Expected behavior:

- Google distinguishes automatic crawlers from user-triggered fetchers.
- Google recommends verification by user-agent, source IP, reverse DNS, and
  forward DNS.
- Google crawlers may use HTTP/1.1 or HTTP/2 and may use caching headers such
  as ETag and If-None-Match.

### Microsoft / Bing / Copilot

Official sources:

- https://www.bing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0
- https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26
- https://www.bing.com/toolbox/bingbot.json

Known agents:

- `bingbot`
- `BingPreview`
- `adidxbot`

Expected behavior:

- Copilot web citations are generally expected to depend on Bing indexing or
  Bing-backed retrieval rather than a separate public Copilot crawler identity.
- Verification should combine user-agent and Microsoft-published IP ranges.

### Anthropic / Claude

Official source:

- https://support.claude.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler

Known agents:

- `ClaudeBot`: model-development crawling.
- `Claude-User`: user-initiated Claude retrieval.
- `Claude-SearchBot`: search quality and search response crawling.

Published IP source:

- https://claude.com/crawling/bots.json

Expected behavior:

- Anthropic separates model training, user retrieval, and search optimization.
- Anthropic supports robots.txt and the non-standard `Crawl-delay` directive.
- A request from an IP in the published list indicates Anthropic origin, but
  bot role still needs user-agent and behavior evidence.

### Perplexity

Official source:

- https://docs.perplexity.ai/docs/resources/perplexity-crawlers

Known agents:

- `PerplexityBot`: search result discovery and linking.
- `Perplexity-User`: user-requested page fetches.

Published IP sources:

- https://www.perplexity.com/perplexitybot.json
- https://www.perplexity.com/perplexity-user.json

Expected behavior:

- `PerplexityBot` should respect robots.txt for indexing/discovery.
- `Perplexity-User` is user-triggered and Perplexity says it generally ignores
  robots.txt rules.
- Both IP and user-agent should be combined for identification.

## Other Major Systems To Add

### Apple / Applebot

Official source: https://support.apple.com/en-us/119829

Known agents:

- `Applebot`
- `Applebot-Extended`
- `iTMS`

Notes:

- Apple says Applebot data powers Spotlight, Siri, Safari, and may be used for
  Apple foundation models.
- `Applebot-Extended` is a control signal for training use and does not crawl
  pages itself.
- Apple publishes IP CIDRs at https://search.developer.apple.com/applebot.json.

### Meta

Official source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers

Known agents from current secondary search result:

- `meta-externalagent`
- `meta-externalfetcher`
- `facebookexternalhit`

Status:

- Needs direct official fetch confirmation. The public page did not render
  through the lightweight fetch during this pass.

### ByteDance

Known agents:

- `Bytespider`

Status:

- Needs official source verification.

### Common Crawl

Known agents:

- `CCBot`

Status:

- Important because many AI data pipelines use Common Crawl datasets, but it is
  not an AI assistant by itself. Needs separate handling.

### Brave Search

Known agents:

- Brave Search bot identities need official confirmation.

Status:

- Relevant because Brave Search can be used by AI answer engines.

### You.com

Known agents:

- `YouBot`

Status:

- Needs official source verification.

### Cohere, Mistral, xAI, Amazon, Databricks, Baidu, Yandex

Status:

- Track as potential targets. Add only after official or observed evidence.
