# Project Plan

## Mission

Build a local-first research system that can document how AI crawlers, AI search
agents, assistant browsing tools, and ordinary browsers retrieve websites.

The end goal is a server-side AI tracker that is reliable because it preserves
raw evidence and explains why each classification was made.

## Principles

- Do not classify from one signal. Store raw evidence and compute confidence.
- Separate crawler, AI assistant browsing, search indexer, HTTP client,
  browser-like agent, and human browser.
- Track absence of behavior: HTML-only fetch, no resources, no JavaScript,
  no depth traversal.
- Compare server-side logs, tracking-pixel hits, and first-party on-page
  analytics events for every standard test attempt.
- Treat consent dialogs, cookie banners, stale DNS, IPv4/IPv6 mismatches, and
  cache artifacts as research signals when they change observed AI behavior.
- Add network enrichment: ASN, reverse DNS, published crawler IP ranges,
  datacenter/VPN/proxy hints.
- Keep findings reproducible with exact URL, prompt, timestamp window, raw
  evidence, expected result, observed result, and limitations.
- Treat mouse movement as a human-likelihood hint, not proof.

## Phases

### Phase 1: Local Evidence Harness

Status: in progress.

- Serve observed pages locally.
- Log server-side HTML requests.
- Log subresource requests.
- Log client capability events.
- Store events locally in `data/events.json`.
- Show raw event records in a dashboard.

### Phase 2: Known AI System Catalog

Status: started.

- Maintain an official-source catalog of crawler names, user agents, IP range
  endpoints, robots behavior, and expected fetch behavior.
- Begin with the top five systems: OpenAI, Google, Microsoft, Anthropic, and
  Perplexity.
- Expand to Apple, Meta, ByteDance, Common Crawl, Brave, You.com, Cohere,
  Mistral, xAI, Amazon, and other relevant agents.

### Phase 3: Network Enrichment

Status: started.

- Fetch and cache official IP range JSON where available.
- Add CIDR matching for incoming request IPs.
- Add reverse DNS and forward DNS verification.
- Add ASN lookup and datacenter classification.
- Record enrichment version and source timestamp on each event.

### Phase 4: Experiment Matrix

Status: planned.

- Test each AI system with standard prompts and URLs.
- Run the same prompt families across ChatGPT, Claude, Gemini, Perplexity,
  Copilot/Bing, and later major systems once access is available.
- Use OpenRouter as a model-access layer for broad prompt-response experiments.
- Compare direct page request, resource fetches, JavaScript execution, crawl
  depth, robots behavior, and timing.
- Add cookie-consent and interstitial variants to measure whether AI retrieval
  accepts, rejects, ignores, or gets blocked by consent UI.
- Store results as findings.

### Phase 5: Reproducible Findings

Status: planned.

- Every finding must include hypothesis, test setup, timestamp window, raw
  evidence, expected result, observed result, reproduction steps, limitations,
  and confidence estimate.
- Findings must be written so another model, agent, or human can rerun them.

## Near-Term Build Tasks

1. Run the official IP range refresh and inspect the local cache.
2. Add request grouping by session/correlation id.
3. Add export of selected events into finding templates.
4. Add robots experiment pages with allow/disallow variants.
5. Add noindex/nosnippet/nofollow experiment pages.
6. Add bait links that are visible in HTML but not rendered to users.
7. Add sitemap-only and robots-only discovery tests.
8. Add reverse DNS and forward DNS verification.
9. Add a public tunnel/subdomain once the LAN tests are stable.
10. Refresh the OpenRouter model catalog and add budget-limited model test runs.
11. Add first-party on-page analytics events to standard manual prompt fixtures.
12. Add tracking-pixel correlation to every standard prompt attempt.
13. Add cookie-consent popup fixtures and manual packet prompts.
14. Expand the ChatGPT manual packet baseline to Claude, Gemini, Perplexity,
    Copilot/Bing, and later major systems.