# Fingerprint Confidence Model

## Goal

Classify observed visits without pretending that one signal is enough.

Each event should preserve raw evidence and produce a confidence score that can
be re-scored later as the classifier improves.

## Entity Types

- `search_index_crawler`: automatic crawler for search/indexing.
- `training_crawler`: automatic crawler for model training data.
- `assistant_user_fetcher`: fetch caused by a user prompt or action.
- `http_client`: curl, wget, Python requests, HTTPX, Go client, etc.
- `browser_like_agent`: headless or full browser without enough human signals.
- `human_browser`: browser with interaction signals.
- `unknown`: insufficient evidence.

## Signal Weights

### Identity Signals

- Official user-agent match: +45 to +60.
- Official IP range match: +35 to +50.
- Same-provider official user-agent plus official IP range: strong confidence
  boost, with raw range evidence stored on the event.
- Reverse DNS and forward DNS verification: weak positive evidence only when
  the PTR hostname forward-confirms back to the source IP.
- Provider role match, such as `ChatGPT-User` with ChatGPT prompt timestamp:
  +20.

### Behavior Signals

- HTML request only and no resource fetch after wait window: +15 crawler/fetcher.
- CSS/image resource fetch: +10 browser-like or rendering fetcher.
- JavaScript event: +20 browser-capable.
- Pointer or scroll event: +20 human-likelihood, not human proof.
- Deep crawl beyond depth 2: +20 crawler-likelihood.
- Exact prompt-time correlation: +20 assistant-user-fetcher likelihood.

### Negative/Absence Signals

- No JS event after HTML and resources: weak evidence against full browser.
- No subresource requests: weak evidence for HTML-only fetcher.
- No depth traversal: weak evidence against broad crawler.
- Robots-disallowed page fetched by user agent documented as automatic crawler:
  potential policy finding, not automatic identity proof.

## Confidence Bands

- 95 to 100: official UA plus official IP/rDNS verification plus behavior
  consistent with the provider role.
- 80 to 94: official UA plus either IP verification or strong behavior.
- 65 to 79: official UA only, or official IP plus prompt-time behavior.
- 50 to 64: weak provider hints but no independent confirmation.
- Under 50: do not label beyond broad class.

## Raw Evidence Required

For each classification, preserve:

- request id
- timestamp
- path
- user-agent
- IP
- source headers
- parent event id
- resource behavior
- JavaScript capability event, if any
- enrichment source versions
- classifier version

Official IP range matches are stored in:

```text
fingerprint.network.officialIpMatches
```

DNS identity evidence is stored in:

```text
fingerprint.network.dns
```

`forward_confirmed` can lightly strengthen provider identity, especially when
it agrees with a claimed user-agent or official IP range. `forward_mismatch`,
`ptr_error`, and `no_ptr` stay visible as unverified signals or limitations.
DNS-only provider evidence remains too weak to assign an entity type by itself.

## Group Evidence Score

Event-level `fingerprint.confidence` is not the same as group-level evidence
confidence.

The lab also scores grouped evidence with:

```text
group.evidenceScore
```

This score is produced by `netlify/lib/evidence-score.mjs` and attached to
request groups, visitor sessions, and test attempts by `buildEventGroups()`.

The first version uses only direct lab evidence:

- server-side page/static request
- subresource fetches
- JavaScript capability beacons
- pointer or scroll interaction
- multi-path activity
- official IP range matches
- forward-confirmed DNS, DNS mismatches, and missing/error DNS states
- event-level identity fingerprint confidence

It intentionally does not treat model response text as direct evidence.
Response claims must be correlated with lab traffic before becoming finding
evidence.

## Important Caveats

- User-agents are spoofable.
- IP ranges change.
- Datacenter traffic is not automatically AI traffic.
- Human users can block JavaScript.
- Bots can run browsers and synthesize pointer events.
- Assistant answers may use cached/indexed content and not fetch the page at all.
