What is the safest crawler policy for a public site that wants search traffic?

Keep search and answer-discovery crawlers open, especially Googlebot and OAI-SearchBot, then make separate policy decisions for training-use controls such as GPTBot, Google-Extended, Applebot-Extended, and CCBot.

Is this benchmark legal advice?

No. It is a source-backed technical reference for crawler policy reviews. Verify each official source and ask qualified counsel for legal or licensing decisions.

Can user-agent strings prove a crawler is real?

No. User-agent strings can be spoofed. Treat them as a first clue and verify high-stakes crawler claims with official IP ranges, reverse DNS, or provider-specific guidance where available.

How should teams cite this dataset?

Cite the page URL, version date, and the specific source row you used. Link to the JSON or CSV distribution when you need a machine-readable audit trail.

Source-backed crawler policy benchmark

AI crawler policy benchmark for sites that want visibility without giving everything away.

This benchmark compares official crawler docs for the user agents most small teams ask about first: OpenAI, Google, Apple, Perplexity, and Common Crawl. Use it to decide which crawlers to allow for discovery, which training-use agents to block, and which user-triggered fetchers need log monitoring instead of simple robots.txt rules.

Version 2026-06-24: robots.txt is an access preference, not authentication. Keep private content behind login. Do not rely on robots.txt to protect secrets, customer data, unpublished pricing, or licensed material.

Download the benchmark dataset

The crawler policy matrix is available as machine-readable JSON and CSV so other guides, audits, and support replies can cite the same source-backed rows instead of copying the table by hand.

JSON dataset

Includes metadata, source URLs, crawler categories, robots.txt strategy notes, and verification guidance.

Download JSON

CSV dataset

Flat table for spreadsheets, audits, and lightweight comparison workflows.

Download CSV

User-agent list

Human-readable token list for robots.txt, logs, and quick crawler policy reviews.

Open user-agent list

Benchmark methodology

What is included

Each row covers one crawler or fetcher token, the operator, documented purpose, practical robots.txt strategy, whether robots.txt applies, verification method, and official source URL.

What is not included

This benchmark does not claim ranking factors, legal rights, or guaranteed AI citations. It maps public crawler controls so teams can make cleaner policy decisions.

Source standard

Rows prefer official documentation from the operator. Where operators separate search crawlers from training-use controls, the benchmark treats those as separate policy choices.

Update cadence

Review before launches, monthly crawler-policy updates, and whenever OpenAI, Google, Apple, Perplexity, or Common Crawl change crawler documentation.

How to cite this benchmark

Use this citation block in client notes, GitHub issues, or crawler policy pull requests.

Source: LLMs.txt Kit, AI crawler policy benchmark, version 2026-06-24.
URL: https://llmstxtkit.com/data/ai-crawler-policy-benchmark.html
Machine-readable data:
- https://llmstxtkit.com/data/ai-crawler-policy-benchmark.json
- https://llmstxtkit.com/data/ai-crawler-policy-benchmark.csv
Use: crawler policy review only; verify official source links before legal or licensing decisions.

Fast recommendation

Discovery

Keep search crawlers open

Blocking Googlebot, Applebot, OAI-SearchBot, or PerplexityBot can reduce eligibility for their search or answer surfaces.

Training use

Separate search from training

Use specific tokens such as GPTBot, Google-Extended, and Applebot-Extended when your policy allows search but not model training use.

Proof

Log and verify bots

Match user agent strings with published IP lists or reverse DNS where available. User-agent text alone can be spoofed.

Crawler policy matrix

Operator	Token	Documented purpose	Robots.txt strategy	Source
OpenAI	`OAI-SearchBot`	Automatic search crawl for ChatGPT search visibility and Search opt-outs.	Allow if ChatGPT search eligibility matters.	OpenAI crawler docs
OpenAI	`GPTBot`	Foundation-model improvement crawl, separate from the Search crawler.	Allow only if training-use policy permits it.	OpenAI crawler docs
OpenAI	`ChatGPT-User`	User-initiated fetches from ChatGPT or Custom GPT actions, not automatic web crawling.	Monitor in logs; robots.txt may not apply to user-initiated requests.	OpenAI crawler docs
Google	`Googlebot`	Google Search crawling across Search features and related surfaces.	Allow for Google Search visibility unless a page should not be indexed.	Google common crawlers
Google	`Google-Extended`	Publisher control for some Gemini and Vertex AI grounding/training uses; it is not Google Search crawling, Search inclusion control, or a Search ranking signal.	Disallow when policy allows Search but opts out of these AI uses.	Google common crawlers
Apple	`Applebot`	Search technology for Apple experiences including Spotlight, Siri, and Safari.	Allow for Apple ecosystem discovery; avoid blocking render-critical assets.	Applebot documentation
Apple	`Applebot-Extended`	Usage control for Apple's generative foundation model training; it does not crawl webpages by itself.	Disallow to opt out of Apple foundation-model training while keeping Applebot discovery.	Applebot documentation
Perplexity	`PerplexityBot`	Perplexity's documented crawler for indexing and answer retrieval surfaces.	Allow if Perplexity answer visibility matters; verify with official IP lists for WAF rules.	Perplexity crawler docs
Perplexity	`Perplexity-User`	User-requested fetcher; Perplexity says it is not used for web crawling or training collection.	Monitor separately from crawler rules because user-requested fetches may ignore robots.txt.	Perplexity crawler docs
Common Crawl	`CCBot`	Common Crawl's open web crawl used for research and datasets.	Allow for open data participation; disallow if your policy forbids broad dataset reuse.	Common Crawl CCBot docs

Copy-ready policy snippets

Choose one starting point, then adjust for your legal, licensing, and growth goals.

Discovery-first policy

User-agent: *
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot
Allow: /

Sitemap: https://example.com/sitemap.xml

Search yes, broad training no

User-agent: *
Allow: /

User-agent: Googlebot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Applebot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

How to prove the policy is working

Publish the robots.txt change with a dated changelog entry.
Confirm the final robots.txt URL returns HTTP 200 and references the sitemap.
Record server logs by user agent, path, status code, and referrer.
Verify important bots with official IP lists or reverse DNS where the operator provides them.
Compare crawler events before and after the policy change so traffic drops are explainable.

The current LLMs.txt Kit preview already uses a public proof dashboard that classifies crawler and referral events from the VPS event log.

Benchmark FAQ

Which crawlers should a new public site allow?

For discovery, keep Googlebot, OAI-SearchBot, Applebot, and PerplexityBot open unless your policy says otherwise. Decide separately on GPTBot, Google-Extended, Applebot-Extended, and CCBot.

Does blocking Google-Extended block Google Search?

No. Google documents Google-Extended as a standalone product token and says it does not impact Search inclusion or Search ranking.

Does blocking GPTBot block ChatGPT search?

OpenAI documents OAI-SearchBot as the crawler for ChatGPT search features. GPTBot is a separate training-use decision.

How do I turn this into proof?

Publish dated robots.txt rules, submit your sitemap after final-domain launch, and monitor server logs for crawler user agents plus status codes.

AI crawler policy benchmark for sites that want visibility without giving everything away.

Download the benchmark dataset

JSON dataset

CSV dataset

User-agent list

Benchmark methodology

What is included

What is not included

Source standard

Update cadence

How to cite this benchmark

Fast recommendation

Keep search crawlers open

Separate search from training

Log and verify bots

Crawler policy matrix

Copy-ready policy snippets

Discovery-first policy

Search yes, broad training no

How to prove the policy is working

Benchmark FAQ

Which crawlers should a new public site allow?

Does blocking Google-Extended block Google Search?

Does blocking GPTBot block ChatGPT search?

How do I turn this into proof?

Sources checked