Daily SEO asset 71 / crawler policy

Do LLM crawlers respect robots meta tags, noindex, or X-Robots-Tag?

Published 2026-06-29. Built for CMS builders, SaaS developers, publishers, and webmasters deciding between robots.txt, robots meta tags, and X-Robots-Tag.

Short answer: do not assume robots meta noindex is a universal LLM crawler opt-out. Use robots.txt for documented crawler tokens and authentication for private content.

Fast answer

If your goal is to understand whether noindex or robots meta tags can opt pages out of LLM crawler use, start with this framing: CMS users often need per-account or per-page controls, but most AI crawler documentation names robots.txt user-agent tokens rather than a universal nolearn or noteach meta directive. The useful deliverable is a decision table for robots.txt, robots meta tags, X-Robots-Tag, authentication, and crawler-specific AI policy rules.

This page is intentionally conservative. It treats crawler files, URL inspection, feeds, and server logs as discovery and measurement aids, not as guaranteed ranking levers.

When to use this playbook

Use it when CMS builders, SaaS developers, publishers, and webmasters deciding between robots.txt, robots meta tags, and X-Robots-Tag need a concrete next step and a page that can be linked from a hub, a community answer, a README, or a launch checklist. The page should help someone make a decision even if they never buy anything or contact the site owner.

The strongest pages in this topic cluster have three traits: they answer one narrow question, they include a copyable artifact, and they link to the relevant tool or proof page so the reader can act immediately.

Recommended workflow

  1. Use robots.txt for crawler access preferences that happen before a page is fetched.
  2. Use robots meta tags or X-Robots-Tag for search indexing and snippet controls after a crawler can fetch the URL.
  3. Do not assume noindex is a training opt-out for every LLM crawler unless that crawler documents support for it.
  4. Use authentication or permissions for private content, because crawler directives are requests rather than access control.

Decision table

Control When it applies Best use Limitation
robots.txtBefore a crawler fetches a URL.Documented crawler user-agent access preferences, such as GPTBot, OAI-SearchBot, Googlebot, or Google-Extended.It is a crawler directive, not account-level permission or true access control.
robots meta tagAfter a crawler can fetch and parse the HTML page.Search indexing and snippet behavior for crawlers that support the directive.If robots.txt blocks the URL, the crawler may never see the meta tag.
X-Robots-TagAfter a crawler can fetch the HTTP response.Indexing controls for non-HTML files or server-level response policies.Still not a universal AI training opt-out unless a crawler documents support.
noindexWhen a supporting search crawler fetches and processes the page or response.Keeping a page out of search indexing or serving surfaces.Do not treat it as a universal nolearn, noteach, or model-training opt-out.
AuthenticationBefore any crawler or user can see private content.Tenant, account, customer, billing, admin, or private CMS pages.Requires product or CMS permission design, not just SEO metadata.

Pre-publish checklist

Copyable working note

Use this as a starting point in a ticket, README, client note, or launch log. Edit it to match the real site before publishing.

Question: Do LLM crawlers respect robots meta tags?
Short answer: do not assume universal support.
Use robots.txt for documented AI crawler tokens.
Use noindex/X-Robots-Tag for search indexing controls.
Use authentication for anything private.
Test: fetch robots.txt, inspect headers/meta, then check logs.

Community answer draft

This no-link draft is written to answer the technical question first. If you post it in a community, review the current thread, platform rules, and disclosure requirements before adding any owned link.

Short answer: do not treat robots meta tags or noindex as a universal LLM crawler opt-out.

The timing matters:
1. robots.txt is checked before fetching a URL, so it is the normal place to express crawler access preferences for crawlers that document and honor user-agent rules.
2. A robots meta tag or X-Robots-Tag header is only visible after the crawler fetches the page or response.
3. noindex is primarily an indexing or serving directive for search engines that support it, not a universal AI training or model-use opt-out.
4. If robots.txt blocks a URL, the crawler may never fetch the page and may never see a page-level meta tag.
5. Private account, customer, or tenant content should use authentication and permissions, because crawler directives are not access control.

For a CMS, I would separate the layers: site owner sets broad robots.txt policy, account or tenant owner controls whether a page is public, and public pages can still use noindex or X-Robots-Tag for search indexing behavior.

Proof and measurement plan

What not to count as proof

Do not count this setup as traffic by itself. A submitted sitemap, an IndexNow receipt, a crawler log hit, or an indexing request can show discovery work, but none of them proves rankings, impressions, clicks, conversions, or AI citations. Organic proof should come from Search Console, analytics, qualified referral evidence, or server logs interpreted for the right purpose.

The main pitfall for this topic is: Relying on a made-up nolearn or noteach meta tag and assuming every LLM crawler will treat it as an opt-out.

Related resources

All free tools

Continue the workflow with this related LLMs.txt Kit resource.

/tools/

Proof dashboard

Continue the workflow with this related LLMs.txt Kit resource.

/proof.html

Sources and guardrails