Back to Blog
robots.txtAI BotsTechnical SEO

robots.txt for AI Bots: Should You Block or Allow GPTBot, ClaudeBot, and PerplexityBot?

AEOprobe TeamMarch 15, 20269 min read

Should you block AI bots in your robots.txt? The short answer: no, not all of them. Blocking every AI crawler means your content disappears from ChatGPT answers, Perplexity citations, and Google AI Overviews. But allowing every bot without distinction means you may be giving away content for model training with no return. The right strategy in 2026 is selective: allow the bots that drive referral traffic and citations, and consider blocking those that only scrape for training data.

This guide covers every major AI crawler you need to know, how robots.txt controls them, and the exact configuration we recommend for most websites.

The 14 AI Crawlers You Need to Know

AI search is not one monolithic bot. There are at least 14 distinct crawlers operated by different companies, each with different purposes. Some fetch content to power real-time search answers. Others collect training data for large language models. Understanding the difference is critical for making informed robots.txt decisions.

Bot Name Operator Purpose Default Behavior
GPTBot OpenAI Fetches pages for ChatGPT search answers and model training Crawls unless blocked
ChatGPT-User OpenAI Real-time browsing when a ChatGPT user clicks "Browse" Crawls unless blocked
ClaudeBot Anthropic Fetches content for Claude's web search and training Crawls unless blocked
Claude-Web Anthropic Legacy crawler identifier (predecessor to ClaudeBot) Crawls unless blocked
anthropic-ai Anthropic General-purpose crawler for Anthropic products Crawls unless blocked
PerplexityBot Perplexity AI Powers Perplexity search results with real-time citations Crawls unless blocked
Google-Extended Google Controls whether your content is used for Gemini model training Crawls unless blocked
Googlebot Google Standard search indexing; also powers AI Overviews Crawls unless blocked
Bytespider ByteDance Training data collection for ByteDance AI models Crawls aggressively unless blocked
Amazonbot Amazon Indexes content for Alexa answers and Amazon search Crawls unless blocked
cohere-ai Cohere Training data collection for Cohere language models Crawls unless blocked
Meta-ExternalAgent Meta Fetches content for Meta AI products and model training Crawls unless blocked
CCBot Common Crawl Open web archiving; widely used as LLM training source Crawls unless blocked
Applebot-Extended Apple Powers Apple Intelligence features and Siri answers Crawls unless blocked

The critical distinction is between search-facing bots (GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Googlebot) that can cite your content and send you traffic, and training-only bots (Bytespider, CCBot, cohere-ai) that collect data to build models without giving you visibility in return.

How robots.txt Controls AI Bot Access

The robots.txt file sits at the root of your domain (e.g., https://example.com/robots.txt) and tells web crawlers which parts of your site they may or may not access. It uses a simple protocol: you specify a User-agent and then list Allow or Disallow rules beneath it.

Here is a basic example that blocks GPTBot entirely:

User-agent: GPTBot
Disallow: /

And here is one that allows GPTBot to crawl everything except your admin area:

User-agent: GPTBot
Allow: /
Disallow: /admin/

Rules are evaluated per user-agent. If a bot does not find its own user-agent string in your robots.txt, it falls back to the wildcard rule (User-agent: *). If there is no wildcard rule, the bot assumes everything is allowed.

A few important technical details:

  • robots.txt is advisory, not enforcement. Well-behaved bots from major companies honor it. Malicious scrapers ignore it entirely. It is not a security mechanism.
  • Rules are path-based. You can allow or block specific directories, not individual query parameters or content types.
  • More specific rules win. Allow: /blog/ takes precedence over Disallow: / for the same user-agent when the path matches.
  • Changes take effect on the next crawl. After updating your robots.txt, bots will pick up the new rules the next time they visit, which could be minutes or days depending on the crawler.

The Case for Allowing AI Bots

The strongest argument for allowing AI crawlers is simple: AI-cited traffic converts well. When ChatGPT, Perplexity, or Google AI Overviews reference your content with a citation link, the users who click through tend to be high-intent. They have already read a summary of what you offer and are clicking to go deeper. Early data from sites tracking AI referral traffic suggests these visitors have measurably higher engagement rates compared to traditional organic search clicks.

Beyond direct traffic, allowing AI bots gives you brand visibility in AI-generated answers. As more users shift from typing queries into Google to asking questions in ChatGPT or Perplexity, your brand either appears in those answers or it does not. Blocking the bots guarantees the latter.

There is also a future-proofing argument. AI search is growing rapidly. Perplexity reported over 100 million monthly queries by late 2025. ChatGPT search is integrated into the default experience for hundreds of millions of users. The websites that show up in these answers today are building brand equity that compounds over time.

Finally, many of these bots also feed into features you already depend on. Googlebot powers AI Overviews, which now appear in a significant share of Google search results. Blocking Googlebot is not a realistic option for most sites, and since AI Overviews use the same crawler, you are already participating in AI search whether you realize it or not.

The Case for Blocking AI Bots

The main concern with allowing AI bots is content being used to train models without compensation or attribution. When GPTBot or CCBot crawls your site, the content may be incorporated into training datasets. The model then generates answers that paraphrase or synthesize your content without linking back to you. For publishers and content creators, this feels like having your work taken without permission.

There are also competitive concerns. If you run a research firm, consulting practice, or premium content business, your intellectual property is your product. Allowing AI bots to ingest your proprietary analysis means that analysis may appear in AI-generated answers available to everyone, including your competitors' customers.

Bandwidth and server load are practical concerns for smaller sites. Some AI crawlers, particularly Bytespider, have been documented crawling aggressively with high request volumes. If your hosting infrastructure is limited, aggressive crawling can impact site performance for real users.

Finally, there is the legal and ethical dimension. The question of whether AI companies have the right to use publicly accessible web content for model training is still being litigated in courts worldwide. Some site owners prefer to block training-only bots as a precautionary measure while these legal questions are resolved.

The Best robots.txt Strategy for 2026

For most websites, the recommended approach is a selective allow policy: permit bots that power search-like experiences with citations, and consider blocking bots that primarily collect training data.

Here is a robots.txt configuration that implements this strategy:

# Allow search-facing AI bots (these cite your content)
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot-Extended
Allow: /

# Allow standard search (also powers AI Overviews)
User-agent: Googlebot
Allow: /

User-agent: Amazonbot
Allow: /

# Block training-only crawlers
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Optional: block Google from using your content for Gemini training
# while still allowing standard search indexing
User-agent: Google-Extended
Disallow: /

# Default: allow everything else
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

This configuration gives you visibility in ChatGPT, Claude, Perplexity, and Google AI Overviews while preventing your content from being scraped purely for model training by Bytespider, CCBot, and others.

Adjust based on your business model. If you run a media site that monetizes through traffic, allowing search-facing bots is almost certainly the right call. If you sell proprietary research, you may want to be more restrictive. The key is making a deliberate choice rather than leaving your robots.txt on its default settings and hoping for the best.

How to Check Your robots.txt AI Bot Rules

Most website owners have no idea what their robots.txt says about AI bots. Many sites still use a robots.txt that was written years before AI crawlers existed, which means every AI bot has full access by default.

AEOprobe's free audit checks your robots.txt against all 14 major AI crawlers and tells you exactly which bots are allowed, blocked, partially blocked, or not mentioned. You get a clear breakdown in the AI Bot Access category of your audit report.

The audit takes less than 60 seconds. Enter your URL, and you will see:

  • Which of the 14 AI bots can access your site
  • Whether you have a wildcard rule that inadvertently allows or blocks everything
  • Whether your sitemap is declared in robots.txt (critical for AI crawlers to discover your full content)
  • An overall AI Bot Access grade from A+ to F

Beyond robots.txt, the audit also evaluates your structured data, meta tags, content quality, and 5 other categories that affect how AI search engines understand and cite your content.

Run your free AEO audit now and see exactly how AI bots interact with your site.

Common Questions

Does blocking GPTBot prevent my site from appearing in ChatGPT answers?

Yes. If you block GPTBot in robots.txt, OpenAI will not crawl your pages, and your content will not appear as a cited source in ChatGPT search results. However, ChatGPT may still reference your brand or general information about your site from its training data if it was collected before you added the block.

Can I block AI training but still appear in AI search results?

Partially. You can block training-only crawlers like CCBot, Bytespider, and Google-Extended while keeping search-facing bots like GPTBot and PerplexityBot allowed. However, GPTBot is used by OpenAI for both search and training. There is currently no way to tell OpenAI "use my content for search but not training" through robots.txt alone. This is a limitation of the current protocol.

How often should I review my robots.txt AI bot rules?

At least quarterly. New AI crawlers appear regularly, and existing ones change their user-agent strings or behavior. AEOprobe tracks all major crawlers so you can audit your configuration whenever a new bot enters the landscape. We recommend running an audit after any robots.txt change to verify the rules are working as intended.

What happens if I do not mention an AI bot in my robots.txt at all?

If a specific bot's user-agent is not listed in your robots.txt, it falls back to the wildcard (User-agent: *) rule. If your wildcard rule says Allow: / or has no Disallow directives, the bot can crawl your entire site. Most websites have a permissive wildcard rule, which means unlisted AI bots get full access by default. This is why it is important to explicitly address the bots you want to block.

Check Your Site's AEO Score

Run a free audit across all 8 categories. See how AI search engines view your content — results in 60 seconds.

Run Free Audit