# Signal Catalog

## Strong Signals

### Server HTML Request

Present when any system fetches an experiment page.

Useful for:

- HTTP-only crawlers
- non-rendering AI retrieval
- search indexers

Limitations:

- no proof of rendering
- user-agent may be fake

### Subresource Requests

Present when a system fetches images, CSS, scripts, or JSON.

Useful for:

- distinguishing renderers from HTML-only fetchers
- finding resource loading policies

Limitations:

- some crawlers fetch selected resources without executing scripts

### Client Runtime Events

Present only when JavaScript executes.

Useful for:

- browser-backed agents
- humans
- headless browsers

Limitations:

- humans can block JS
- bots can synthesize events

## Medium Signals

### User-Agent

Honest crawlers often self-identify.

Examples:

- `GPTBot`
- `ChatGPT-User`
- `OAI-SearchBot`
- `ClaudeBot`
- `Claude-User`
- `PerplexityBot`
- `Google-Extended`
- `GoogleOther`
- `bingbot`
- `Bytespider`
- `CCBot`

Limitations:

- trivial to spoof
- product names change
- some agents use generic browser UAs

### IP / ASN / Reverse DNS

Useful for detecting cloud, datacenter, and product-owned networks.

Limitations:

- requires enrichment data
- VPNs and corporate networks can look similar
- crawlers can proxy through many networks

### Fetch Metadata Headers

Headers such as `sec-fetch-site`, `sec-fetch-mode`, `sec-fetch-dest`, and
`sec-ch-ua` can reveal browser behavior.

Limitations:

- absent in many non-browser clients
- can be missing in privacy-hardened browsers

## Weak Signals

### Mouse Movement

Real users often move pointers, but absence of movement is not proof of bot.

Use as:

- positive human-likelihood signal
- browser-capability signal

Do not use as:

- sole bot classifier

### Timing

Fast sequential page hits may indicate crawler traversal.

Limitations:

- network batching and caches distort timing
- humans can open many tabs

## Recommended Scoring Model

Start with additive, explainable scoring:

- known AI UA: +0.5 bot confidence
- known HTTP library UA: +0.4 automation confidence
- HTML request without any resource fetch after 10 seconds: +0.25 crawler confidence
- subresource fetches but no JS event: +0.15 renderer/non-JS confidence
- JS load plus pointer/scroll/focus: +0.35 human/browser confidence
- datacenter ASN: +0.2 automation confidence
- published crawler IP match: +0.5 bot confidence

Store scores as versioned heuristics so findings can say which classifier
version produced the label.
