# Finding 018: Perplexity deep dive — undisclosed crawlers, third-party index, and the tracking blind spot

## Date

2026-06-28

## Status

Published

## Summary

A deep dive into Perplexity's retrieval architecture reveals that the content
Perplexity served about the lab in Finding 017 ("search for" variant) may
have been acquired through channels that are invisible to the lab server's
instrumented event logging.

The lab's current event store has not recorded a visit from `PerplexityBot` or
`Perplexity-User` in the 9 retained instrumented events reviewed, and no
reviewed event came from an IP in Perplexity's published IP ranges. Yet
Perplexity returned detailed, accurate content from the lab site with direct
quotes and source URLs.

Three possible explanations exist:

1. **Undisclosed crawlers with spoofed user-agents.** According to Wikipedia
   (citing Wired and Cloudflare analyses), Perplexity uses undisclosed web
   crawlers with spoofed user-agent strings to scrape content. These crawlers
   would not be identifiable as Perplexity in server logs.

2. **Third-party search index.** Perplexity may surface content through a
   third-party search index (e.g., Bing API, Brave Search API) that has
   crawled the lab independently. Perplexity's API docs confirm it integrates
   external search providers.

3. **PerplexityBot visited but was not logged.** The lab's event store is
   capped at 1000 events and the lab only went live recently. If PerplexityBot
   visited before the event store was active, or if the event was evicted,
   it would not appear in current logs.

This finding documents the tracking blind spot: Perplexity can surface your
content to users without any detectable signal in your server logs.

## Method

### Server log analysis

The lab server's production event store (Netlify Blobs) was queried for all
events since the lab went live. All 9 events were examined for PerplexityBot
or Perplexity-User user-agents, and all IPs were checked against Perplexity's
published IP ranges.

### Official documentation review

Perplexity's official crawler documentation was fetched and reviewed:
- https://docs.perplexity.ai/docs/resources/perplexity-crawlers
- https://www.perplexity.com/perplexitybot.json (IP ranges)
- https://www.perplexity.com/perplexity-user.json (IP ranges)
- https://docs.perplexity.ai/docs/agent-api/tools/web-search.md
- https://docs.perplexity.ai/docs/agent-api/tools/fetch-url-content.md

### Third-party source review

Wikipedia's Perplexity AI article was reviewed, which cites Wired and
Cloudflare analyses confirming that Perplexity uses undisclosed web crawlers
with spoofed user-agent strings.

### IP range cross-reference

All IPs in the lab's event store were cross-referenced against Perplexity's
published IP ranges for both PerplexityBot and Perplexity-User.

## Raw Evidence

### Perplexity's official crawler documentation

Perplexity publishes two user agents:

1. **PerplexityBot** — "designed to surface and link websites in search
   results on Perplexity." Full UA string:
   `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)`
   Should respect robots.txt. Published IP ranges:
   - 107.20.236.150/32
   - 3.224.62.45/32
   - 18.210.92.235/32
   - 3.222.232.239/32
   - 3.211.124.183/32
   - 3.231.139.107/32
   - 18.97.1.228/30
   - 18.97.9.96/29

2. **Perplexity-User** — "supports user actions within Perplexity. When users
   ask Perplexity a question, it might visit a web page to help provide an
   accurate answer." Full UA string:
   `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)`
   Generally ignores robots.txt (user-triggered). Published IP ranges:
   - 44.208.221.197/32
   - 34.193.163.52/32
   - 18.97.21.0/30
   - 18.97.43.80/29

### Lab server log analysis

Total events in production: 9
PerplexityBot user-agent matches: 0
Perplexity-User user-agent matches: 0
Perplexity IP range matches: 0

No event in the lab's production event store has ever come from a Perplexity
IP address or used a Perplexity user-agent string.

### Wikipedia / Wired / Cloudflare

From Wikipedia's Perplexity AI article:

> "According to separate analyses by Wired and, later, Cloudflare, Perplexity
> uses undisclosed web crawlers with spoofed user-agent strings to scrape the
> content of websites that prohibit or explicitly block web scraping."

### Perplexity API documentation

Perplexity's Agent API offers two relevant tools:
- `web_search` — searches the web and retrieves relevant web page contents
- `fetch_url` — fetches and extracts content from specific URLs

The web_search tool returns `last_updated` dates for each result, indicating
that Perplexity maintains or accesses a cached index of web content with
metadata about when it was last updated.

## Interpretation

### How Perplexity acquired the lab's content

Given that:
- PerplexityBot has never visited the lab server (0 events)
- No IP in Perplexity's published ranges has ever hit the lab server
- Perplexity returned accurate, detailed content with direct quotes
- Perplexity cited `https://ai-crawler-lab.kaistone.ai` as a source

The available evidence does not identify a single acquisition channel. The
plausible channels are:

1. **Undisclosed crawlers.** Perplexity's undisclosed crawlers (confirmed by
   Wired and Cloudflare) may have crawled the lab site using spoofed
   user-agents that appear as regular browser traffic. The two events from
   `198.23.130.204` (a server IP, not a residential IP) with a Firefox
   user-agent could potentially be an undisclosed Perplexity crawler, though
   there is no definitive evidence.

2. **Third-party search index.** Perplexity may use Bing or another search
   API as a backend. If Bing has indexed the lab site (which is possible
   even though our Copilot test showed no Bing search results for the
   domain — Bing may have crawled it without making it searchable in
   Copilot's filtered results), Perplexity could retrieve content through
   that index without directly hitting the origin server.

3. **The lab's own public pages.** Perplexity's search may have found the
   lab's content through the public-facing pages (index, findings, etc.)
   which are linked from kaistone.ai and potentially other referrers. The
   content Perplexity quoted matches the lab's home page text closely.

### The tracking blind spot

The key implication for AEO/SEO practitioners is that **Perplexity can
surface your content without any detectable signal in your server logs**.
This creates a blind spot:

1. **You cannot detect Perplexity retrieval through standard analytics.**
   No PerplexityBot user-agent, no Perplexity IP, no detectable request.

2. **You cannot distinguish Perplexity's undisclosed crawlers from regular
   browser traffic.** The spoofed user-agents look like normal browsers.

3. **You cannot correlate Perplexity chat responses with server-side
   events.** When a user asks Perplexity about your site, no live request
   reaches your server. The content is served from cache.

### How to track Perplexity

Despite the blind spot, there are several approaches:

1. **Robots.txt + PerplexityBot IP monitoring.** Allow PerplexityBot in
   robots.txt and monitor for requests from Perplexity's published IP ranges.
   This will detect the *official* crawler but not the undisclosed ones.

2. **Unique content fingerprints.** Embed unique, identifiable content
   (tracking pixels, unique text strings, unique URLs) in your pages. If
   Perplexity surfaces this content, you can confirm it was crawled and
   identify which content was indexed.

3. **Server-side JavaScript beacon.** Deploy a JavaScript beacon that fires
   on page load and sends a request to your server. If the beacon fires from
   a Perplexity IP or a headless browser, it may indicate Perplexity crawling.
   However, this only works if Perplexity's crawler executes JavaScript —
   which our findings suggest AI crawlers generally do not.

4. **DNS-based tracking.** Use unique subdomains or DNS records that are
   referenced in your content. If Perplexity resolves these DNS names, you
   can detect the lookup in DNS logs.

5. **Perplexity Pages monitoring.** Manually search Perplexity for your
   domain periodically and compare the returned content against your current
   page content. Changes in the returned content indicate re-crawling.

6. **Allow PerplexityBot and log aggressively.** Explicitly allow
   PerplexityBot in robots.txt, then log all requests from Perplexity's IP
   ranges with full headers. This maximizes the chance of detecting official
   crawler visits, even if it doesn't catch undisclosed crawlers.

## Limitations

- The lab server's event store is capped at 1000 events and the lab only went
  live recently. Historical PerplexityBot visits may have been evicted.
- The lab does not currently log requests that don't hit the `/lab/root` or
  `/track/` endpoints. PerplexityBot may have visited the home page or other
  public pages without triggering a logged event.
- The two events from `198.23.130.204` have not been definitively attributed
  to any provider. This IP could belong to a hosting provider, a VPN, or an
  undisclosed crawler.
- The Wikipedia claims about undisclosed crawlers cite Wired and Cloudflare
  analyses from 2024. Perplexity's crawling practices may have changed since
  then.
- Because the Perplexity quote matches the public homepage and the lab may not
  log every request to `/`, absence from `/api/hits` does not rule out an
  earlier plain homepage fetch by an official, third-party, or spoofed crawler.
- This analysis was conducted with web search and web fetch tools that had
  rate limits active, preventing a full review of all primary sources.

## Publication Thesis Verification

- Thesis: Perplexity can surface content without a matching `PerplexityBot` or
  `Perplexity-User` event in the current instrumented logs, likely through an
  indexed/cache path. The specific acquisition channel is unresolved; possible
  channels include an unlogged homepage fetch, a third-party index, official
  crawling outside the retained events, or undisclosed/spoofed crawlers. This
  creates a tracking blind spot for server-side analytics.
- Source: Lab server production event store (0 Perplexity visits),
  Perplexity official crawler docs, Wikipedia (citing Wired and Cloudflare),
  Perplexity API docs.
- Method: Server log analysis, IP range cross-reference, official
  documentation review, third-party source review.
- Bias: The lab is a small, recently-launched site with limited traffic.
  Perplexity's behavior may differ for larger, more established sites.
- Consensus: Consistent with Finding 017 (Perplexity served cached content
  with no origin hit). Consistent with Wikipedia/Wired/Cloudflare reports
  of undisclosed crawlers.
- Invalidation: Monitor server logs over a longer period (weeks/months) to
  catch PerplexityBot visits. Submit the lab domain to Perplexity for
  crawling and monitor for visits. Set up unique content fingerprints and
  monitor for their appearance in Perplexity responses.
- Verdict: The tracking blind spot is well-supported for the observed
  Perplexity response path. The specific acquisition channel remains
  unresolved.
- Confidence: high for the tracking blind spot; medium for the specific
  acquisition channel (undisclosed crawlers vs. third-party index).
- Additional tests suggested: monitor server logs for PerplexityBot or
  Perplexity-associated crawlers over time; test whether blocking known
  Perplexity crawler IPs changes the quality of Perplexity responses; run
  Perplexity tests on a brand-new domain with no third-party index presence
  to verify whether responses degrade.

## Follow-up tasks

1. Allow PerplexityBot in the lab's robots.txt and monitor for official
   crawler visits.
2. Deploy unique content fingerprints on lab pages to detect Perplexity
   indexing.
3. Monitor the two events from `198.23.130.204` — investigate whether this
   IP belongs to a hosting provider used by Perplexity.
4. Submit the lab domain to Perplexity for crawling via their documented
   process and monitor for visits.
5. Test whether Perplexity's `fetch_url` API tool (Agent API) produces a
   detectable origin hit when fetching a lab page.
6. Periodically search Perplexity for the lab domain and compare returned
   content against current page content to detect re-crawling.
