Should I Allow AI Bots in robots.txt?


The instinct is understandable. AI companies are scraping the web to train their models, and you didn’t sign up for that. So you add a few lines to your robots.txt, block GPTBot, ClaudeBot, and Google-Extended, and feel like you’ve taken back control.

But here’s the thing you might not have considered: blocking AI crawlers doesn’t just prevent training. It prevents citation. And citation is where the value is.

What AI bots actually do

Not all AI bots serve the same purpose. Understanding the difference is critical before you decide what to block.

Training crawlers collect content to train or fine-tune models. This is the use case most people object to. Your content gets absorbed into a model and used to generate responses without attribution.

Retrieval crawlers fetch your content in real time when a user asks a question. This is how ChatGPT with browsing, Perplexity, and Google AI Overviews work. They visit your page, read it, and cite it in their response with a link back to you.

Index crawlers build a searchable index of the web that AI systems reference. Google-Extended feeds into Gemini. Blocking it means your content doesn’t appear in Google’s AI features.

When you block all AI bots indiscriminately, you’re not just preventing training. You’re also preventing the retrieval and citation that drives traffic back to your site.

The real cost of blocking

Every time someone asks an LLM a question in your space and your content isn’t available for retrieval, a competitor’s content gets cited instead. That’s not a hypothetical. It’s happening right now, millions of times a day.

AI-powered search is growing faster than traditional search. Perplexity, ChatGPT with browsing, Google AI Overviews, and Microsoft Copilot are becoming primary research tools for a significant portion of users. If your content is invisible to these systems, you’re opting out of a discovery channel that will only get larger.

The irony is that many site owners block AI bots to “protect” their content while simultaneously spending money on SEO to make that same content more visible. Visibility is visibility. You can’t optimize for discovery and hide from discoverers at the same time.

When blocking makes sense

There are legitimate reasons to block specific AI crawlers:

Proprietary data. If your pages contain data that loses value when extracted (premium research, subscription content, proprietary databases), blocking makes sense. But use authentication and paywalls rather than robots.txt, which is advisory and not enforced.

Bandwidth concerns. Some AI crawlers are aggressive and can spike your server load. Rate limiting through your CDN or server config is more effective than robots.txt for this problem.

Specific bot objections. If you have a policy objection to a specific company’s training practices, you can block their bot while allowing others. This is a targeted approach rather than a blanket ban.

For most websites, the optimal configuration is: allow everything, block nothing.

If you want nuance, allow all retrieval and index bots (GPTBot, ClaudeBot, Anthropic-AI, Google-Extended, PerplexityBot) and only block bots you can confirm are used exclusively for training. In practice, the major bots serve both retrieval and training, so blocking them cuts off both.

A well-configured robots.txt for AI visibility looks like this:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Anthropic-AI
Allow: /

You can generate a properly configured robots.txt in seconds with the hey-eye robots.txt generator. It lets you toggle each AI bot individually so you can make deliberate choices about which crawlers to allow.

Check your current configuration

Many site owners don’t know what their robots.txt currently says about AI bots. Content management systems, hosting providers, and security plugins sometimes add AI bot blocks by default without telling you.

Run your site through hey-eye and check the Authority & Trust pillar. It specifically reports which AI bots are allowed and which are blocked. If you see “AI bots blocked” and you didn’t set that intentionally, your CMS or hosting provider might have done it for you.

The bottom line

Robots.txt is not a content protection mechanism. It’s a crawl directive. Blocking AI bots doesn’t prevent your content from being used in training (determined crawlers ignore robots.txt), but it does prevent your content from being cited in AI-powered search results.

Unless you have a specific, strategic reason to block, the default should be to allow. The visibility you gain from AI citations is worth far more than the theoretical protection of blocking crawlers that may not respect your directives anyway.

Read More