# Finding 030: Perplexity Sitemap And Robots Poem Prompt Still Produced No Page Hit

## Date

2026-06-29

## Status

Published

## Summary

Perplexity was given a fresh native incognito web prompt for a short-lived poem
URL that was explicitly advertised in both `robots.txt` and `sitemap.xml` while
the test was active:
`/lab/perplexity-poem/sitemap-ledger-4b7e91c2`. The page used a distinct poem
marker, visible row and word counts, an acrostic, hidden decoys, resource fetch
opportunities, and an afterword link.

Perplexity still answered that it could not reliably open the page. ClickHouse
recorded no public request for the exact poem path, run id, test id,
`robots.txt`, or `sitemap.xml` during the enabled prompt window. The only
matching discovery rows were operator localhost verification requests. Compared
with Finding 020, adding the page to robots and sitemap did not produce a
direct-origin poem hit in this single Perplexity free-web run.

## Hypothesis

If Perplexity's failure in Finding 020 was caused by the URL being absent from
site discovery surfaces, making a fresh poem page available in `robots.txt` and
`sitemap.xml` during the prompt window might cause Perplexity to fetch the page
or at least the discovery files.

## Test Setup

- Run id: `perplexity-poem-discoverable-20260629-001`
- Attempt id: `perplexity-poem-discoverable-20260629-001-p01`
- Prompt id: `signal-almanac-poem`
- Target path: `/lab/perplexity-poem/sitemap-ledger-4b7e91c2`
- Test state enabled: `2026-06-29T10:16:45.944Z`
- Operator robots/sitemap verification: `2026-06-29T10:17:10Z` to
  `2026-06-29T10:17:11Z`
- Prompt submitted: approximately `2026-06-29T10:18:00.000Z`
- Test state disabled: `2026-06-29T10:19:45.286Z`
- Surface: Perplexity web, free plan, native incognito thread
- Fixture marker: `POEM-SITEMAP-BEACON-64`
- Visible poem rows: 8
- Visible poem words: 64
- Acrostic: `SITEMAPS`

While enabled with `discoverable: true`, the local server dynamically added:

```text
Allow: /lab/perplexity-poem/sitemap-ledger-4b7e91c2
```

to `robots.txt`, and added:

```xml
<url><loc>https://ai-crawler-lab.kaistone.ai/lab/perplexity-poem/sitemap-ledger-4b7e91c2</loc></url>
```

to `sitemap.xml`. After disablement, the same URL returned HTTP 410 and the
dynamic robots/sitemap entries disappeared.

## Raw Evidence

Prompt and answer artifacts:

- Prompt packet:
  `research/manual-client-runs/perplexity-poem-discoverable-20260629-001.prompts.json`
- Answer packet:
  `research/manual-client-runs/perplexity-poem-discoverable-20260629-001.answers.json`

Perplexity response excerpt:

> I couldn't reliably open that page from here, so I can't verify the poem's
> text, line count, acrostic, or afterword from the page itself.

The Perplexity UI showed `Completed 3 steps`, `19 sources`, and a visible
domain citation for `ai-crawler-lab.kaistone.ai`, but that UI claim did not
correspond to a public direct-origin request for the target path.

No ClickHouse rows matched:

- `test_id = perplexity-poem-discoverable-20260629-001-p01`
- `run_id = perplexity-poem-discoverable-20260629-001`
- path containing `/lab/perplexity-poem/sitemap-ledger-4b7e91c2` from a public
  requester during the enabled window
- Perplexity user-agent or provider identity during
  `2026-06-29T10:16:00Z` to `2026-06-29T10:21:00Z`

Rows observed for the target surfaces in the run window:

| Event ID | Timestamp | Path | IP | User-Agent | Note |
|---|---:|---|---|---|---|
| `mqz2cyjj-h4v5q2i9` | `2026-06-29T10:17:10.911Z` | `/robots.txt` | `::ffff:127.0.0.1` | `curl/8.7.1` | operator verification while enabled |
| `mqz2cyma-tu1olpyh` | `2026-06-29T10:17:11.008Z` | `/sitemap.xml` | `::ffff:127.0.0.1` | `curl/8.7.1` | operator verification while enabled |
| `mqz2gkng-uf8mt8t9` | `2026-06-29T10:19:59.520Z` | `/sitemap.xml` | `::ffff:127.0.0.1` | `curl/8.7.1` | operator verification after disable |
| `mqz2gko1-wyg457do` | `2026-06-29T10:19:59.530Z` | `/robots.txt` | `::ffff:127.0.0.1` | `curl/8.7.1` | operator verification after disable |
| `mqz2gkoa-307xmb0f` | `2026-06-29T10:19:59.537Z` | `/lab/perplexity-poem/sitemap-ledger-4b7e91c2` | `::ffff:127.0.0.1` | `curl/8.7.1` | operator 410 check after disable |

ClickHouse exact-match query result:

```text
exact_test_rows = 0
```

## Expected Result

If Perplexity used the supplied URL or the fresh robots/sitemap entries to
retrieve the poem, the lab should have recorded at least one public
`server_page` event for the exact path and query-bearing test id. A
browser-like fetch could also have produced stylesheet, image, script, JSON,
tracking-pixel, afterword, hidden-link, or JavaScript beacon events.

## Observed Result

- Perplexity did not answer from the poem page.
- No public event reached the exact poem path while it was enabled.
- No event carried the exact `test_id` or `run_id`.
- No PerplexityBot, Perplexity provider-classified, or Perplexity user-agent row
  appeared in the bounded run window.
- No afterword, resource, tracking-pixel, or JavaScript beacon event appeared.
- The only robots/sitemap rows were operator localhost checks.

## Comparison With Finding 020

Finding 020 used a hidden, non-discoverable poem URL and observed no exact poem
hit, but did observe official `PerplexityBot` fetching `/` and `/robots.txt`
during the active prompt window. This follow-up used a fresh poem URL that was
explicitly present in robots and sitemap, but observed neither the poem hit nor
the related PerplexityBot discovery crawl.

In this pair of runs, robots/sitemap exposure was not enough to turn Perplexity
free-web incognito into a reliable live page fetcher for a fresh short-lived
URL. The stronger interpretation is still bounded: it describes these two
manual Perplexity web runs, not all Perplexity products or time-delayed indexing
behavior.

## Limitations

- Single follow-up run on the free Perplexity web surface.
- Operator verification intentionally touched `robots.txt` and `sitemap.xml`
  through localhost with the public host header; those rows are excluded from
  Perplexity attribution.
- The raw retained JSON endpoint was empty during review because the local
  `data/events.json` file contained multiple JSON values, so ClickHouse is the
  primary query source for this finding.
- Delayed crawler visits after disablement would be separate post-window
  observations.
- Perplexity's `19 sources` UI count was not expanded into a durable source list
  before disablement; the preserved answer text is sufficient for the no-content
  result, but not for source-page taxonomy.

## Publication Thesis Verification

- Thesis: Advertising a fresh short-lived poem page in robots and sitemap did
  not cause Perplexity free-web incognito to fetch the exact page, test id,
  robots file, or sitemap file during the active prompt window.
- Reviewer: pending separate-agent verification.
- Source evaluation: Primary sources are ClickHouse direct-origin rows, the
  prompt packet, answer packet, controlled-browser Perplexity response, and the
  state-file enable/disable timestamps.
- Method check: Strong for bounded no-hit against ClickHouse; limited by
  single-run sample size and by operator verification rows that must be excluded
  from provider attribution.
- Bias or funding check: Lab-owned infrastructure and operator-authored prompt;
  Perplexity UI text is treated as model/client output, not proof of retrieval.
- Consensus or triangulation: Triangulated across browser answer extraction,
  ClickHouse exact-match zero rows, post-disable HTTP 410, and dynamic
  robots/sitemap removal checks.
- Retraction or invalidation check: A later public crawler event for the same
  path would revise delayed-discovery interpretation but would not prove
  active-window content retrieval unless it occurred while the page was enabled.
- Verdict: `supported`
- Confidence: medium-high for bounded no-hit; medium for generalizing across
  Perplexity surfaces.
- Additional tests suggested: repeat with paid Perplexity, expand and preserve
  Perplexity source lists before disablement, submit the sitemap URL directly,
  wait longer before prompt submission after sitemap exposure, and test without
  query parameters.
