JSON dataset
Includes metadata, source URLs, crawler categories, robots.txt strategy notes, and verification guidance.
Download JSONThis benchmark compares official crawler docs for the user agents most small teams ask about first: OpenAI, Google, Apple, Perplexity, and Common Crawl. Use it to decide which crawlers to allow for discovery, which training-use agents to block, and which user-triggered fetchers need log monitoring instead of simple robots.txt rules.
The crawler policy matrix is available as machine-readable JSON and CSV so other guides, audits, and support replies can cite the same source-backed rows instead of copying the table by hand.
Includes metadata, source URLs, crawler categories, robots.txt strategy notes, and verification guidance.
Download JSONFlat table for spreadsheets, audits, and lightweight comparison workflows.
Download CSVHuman-readable token list for robots.txt, logs, and quick crawler policy reviews.
Open user-agent listEach row covers one crawler or fetcher token, the operator, documented purpose, practical robots.txt strategy, whether robots.txt applies, verification method, and official source URL.
This benchmark does not claim ranking factors, legal rights, or guaranteed AI citations. It maps public crawler controls so teams can make cleaner policy decisions.
Rows prefer official documentation from the operator. Where operators separate search crawlers from training-use controls, the benchmark treats those as separate policy choices.
Review before launches, monthly crawler-policy updates, and whenever OpenAI, Google, Apple, Perplexity, or Common Crawl change crawler documentation.
Use this citation block in client notes, GitHub issues, or crawler policy pull requests.
Source: LLMs.txt Kit, AI crawler policy benchmark, version 2026-06-24. URL: https://llmstxtkit.com/data/ai-crawler-policy-benchmark.html Machine-readable data: - https://llmstxtkit.com/data/ai-crawler-policy-benchmark.json - https://llmstxtkit.com/data/ai-crawler-policy-benchmark.csv Use: crawler policy review only; verify official source links before legal or licensing decisions.
Blocking Googlebot, Applebot, OAI-SearchBot, or PerplexityBot can reduce eligibility for their search or answer surfaces.
Use specific tokens such as GPTBot, Google-Extended, and Applebot-Extended when your policy allows search but not model training use.
Match user agent strings with published IP lists or reverse DNS where available. User-agent text alone can be spoofed.
| Operator | Token | Documented purpose | Robots.txt strategy | Source |
|---|---|---|---|---|
| OpenAI | OAI-SearchBot |
Automatic search crawl for ChatGPT search visibility and Search opt-outs. | Allow if ChatGPT search eligibility matters. | OpenAI crawler docs |
| OpenAI | GPTBot |
Foundation-model improvement crawl, separate from the Search crawler. | Allow only if training-use policy permits it. | OpenAI crawler docs |
| OpenAI | ChatGPT-User |
User-initiated fetches from ChatGPT or Custom GPT actions, not automatic web crawling. | Monitor in logs; robots.txt may not apply to user-initiated requests. | OpenAI crawler docs |
Googlebot |
Google Search crawling across Search features and related surfaces. | Allow for Google Search visibility unless a page should not be indexed. | Google common crawlers | |
Google-Extended |
Publisher control for some Gemini and Vertex AI grounding/training uses; it is not Google Search crawling, Search inclusion control, or a Search ranking signal. | Disallow when policy allows Search but opts out of these AI uses. | Google common crawlers | |
| Apple | Applebot |
Search technology for Apple experiences including Spotlight, Siri, and Safari. | Allow for Apple ecosystem discovery; avoid blocking render-critical assets. | Applebot documentation |
| Apple | Applebot-Extended |
Usage control for Apple's generative foundation model training; it does not crawl webpages by itself. | Disallow to opt out of Apple foundation-model training while keeping Applebot discovery. | Applebot documentation |
| Perplexity | PerplexityBot |
Perplexity's documented crawler for indexing and answer retrieval surfaces. | Allow if Perplexity answer visibility matters; verify with official IP lists for WAF rules. | Perplexity crawler docs |
| Perplexity | Perplexity-User |
User-requested fetcher; Perplexity says it is not used for web crawling or training collection. | Monitor separately from crawler rules because user-requested fetches may ignore robots.txt. | Perplexity crawler docs |
| Common Crawl | CCBot |
Common Crawl's open web crawl used for research and datasets. | Allow for open data participation; disallow if your policy forbids broad dataset reuse. | Common Crawl CCBot docs |
Choose one starting point, then adjust for your legal, licensing, and growth goals.
User-agent: * Allow: / User-agent: OAI-SearchBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Applebot Allow: / Sitemap: https://example.com/sitemap.xml
User-agent: * Allow: / User-agent: Googlebot Allow: / User-agent: OAI-SearchBot Allow: / User-agent: Applebot Allow: / User-agent: GPTBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: CCBot Disallow: / Sitemap: https://example.com/sitemap.xml
The current LLMs.txt Kit preview already uses a public proof dashboard that classifies crawler and referral events from the VPS event log.
For discovery, keep Googlebot, OAI-SearchBot, Applebot, and PerplexityBot open unless your policy says otherwise. Decide separately on GPTBot, Google-Extended, Applebot-Extended, and CCBot.
No. Google documents Google-Extended as a standalone product token and says it does not impact Search inclusion or Search ranking.
OpenAI documents OAI-SearchBot as the crawler for ChatGPT search features. GPTBot is a separate training-use decision.
Publish dated robots.txt rules, submit your sitemap after final-domain launch, and monitor server logs for crawler user agents plus status codes.