# Finding 001: AI Bot Identification Requires Multi-Signal Evidence

## Status

Proposed.

## Summary

The first version of this project should not identify AI systems from a single
request property. Official provider documentation shows that major systems use
multiple agents for different purposes: automatic search crawlers, training
crawlers, user-triggered fetchers, and preview/fetch tools. The tracker must
store raw evidence and classify with confidence scores.

## Hypothesis

A reliable AI tracker needs to combine user-agent, IP range, reverse DNS, server
request behavior, subresource behavior, JavaScript capability, crawl depth, and
prompt-time correlation.

## Test Setup

- Project version: `0.1.0`
- Classifier version: `0.1.0`
- Test URL: `/lab/root`
- Triggering action: local `curl` smoke test
- Timestamp window: immediate
- Environment: local Node server on port `8787`

## Raw Evidence

Local smoke tests produced server-side `server_page` events for `/lab/root`
with `curl/8.7.1`. The classifier initially recorded evidence for `curl` but
classified the event as `unknown`, proving that the score threshold and entity
types needed refinement.

## Expected Result

An HTTP client user-agent such as `curl` should not be labeled as a specific AI
system, but it should be classified as a likely automation or HTTP client.

## Observed Result

The classifier was adjusted to classify automation clients as
`likely_automation_client` or `likely_ai_or_automation` depending on total
score.

## Interpretation

This finding supports a broad rule: identity, network, and behavior should be
stored separately. A system can be automated without being AI, and a system can
be AI-backed without revealing a branded crawler user-agent.

## Limitations

- This finding is based on local smoke testing and official documentation, not
  a live observation of the top-five AI systems.
- The confidence model is still heuristic and must be validated by repeated
  tests.

## Proposed Next Test

Run the same `/lab/root` URL through ChatGPT, Claude, Perplexity, Gemini, and
Copilot after the site is reachable from the public internet. Compare server
HTML requests, resource fetches, JavaScript events, IP enrichment, and prompt
timestamp correlation.

## Sources

- OpenAI crawler roles: https://platform.openai.com/docs/bots
- Anthropic crawler roles: https://support.claude.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
- Google crawler/fetcher categories: https://developers.google.com/crawling/docs/crawlers-fetchers/overview-google-crawlers
- Google verification method: https://developers.google.com/crawling/docs/crawlers-fetchers/verify-google-requests
- Perplexity crawler roles: https://docs.perplexity.ai/docs/resources/perplexity-crawlers
- Bing IP source: https://www.bing.com/toolbox/bingbot.json
