Back to Learning Hub

robots.txt for AI Bots

Control which AI engines can access your content

What is robots.txt?

The robots.txt file is a simple text file that tells bots and crawlers which parts of your website they can visit. Think of it like a sign at the entrance of your website that says "visitors welcome" or "private area." Every bot that follows the rules (called the Robots Exclusion Protocol) checks this file first before crawling your site.

For AI search engines, robots.txt is especially important. It controls whether AI bots like GPTBot (ChatGPT), ClaudeBot (Claude), and PerplexityBot can access your content for training and search results. Setting this up correctly helps you manage your AI bot access effectively.

Your robots.txt file must be located at yoursite.com/robots.txt. Bots won't look for it anywhere else. If you don't have this file, bots assume they can crawl everything.

Why robots.txt Matters for AI

AI bots are different from traditional search engine crawlers. They visit your site for two main reasons:

Training Data Collection

Some AI companies use web content to train their language models. They crawl millions of pages to build knowledge bases.

You can control whether your content is used for training by blocking specific bots in robots.txt.

Search Result Generation

AI search engines crawl your content to include it in their search results and answer generation.

Allowing these bots helps your content appear in AI-generated answers, improving your GEO-Score.

The key is finding the right balance. You want AI search engines to access your content for visibility, but you might want to block certain areas or specific training bots. Your robots.txt file gives you this control.

Major AI Bot User-Agents

Each AI bot identifies itself with a unique user-agent string. Here are the most important ones:

GPTBot

OpenAI

User-agent: GPTBot

Used by: ChatGPT, OpenAI search features

GPTBot crawls content for both ChatGPT responses and training. Blocking it prevents your content from appearing in ChatGPT's web search results.

ClaudeBot

Anthropic

User-agent: ClaudeBot

Used by: Claude AI, Anthropic's AI assistant

ClaudeBot accesses web content to provide current information in Claude's responses. It respects robots.txt rules strictly.

PerplexityBot

Perplexity

User-agent: PerplexityBot

Used by: Perplexity AI search engine

PerplexityBot powers one of the most popular AI search engines. Allowing it improves visibility in Perplexity search results.

Google-Extended

Google

User-agent: Google-Extended

Used by: Google Gemini AI training

This is separate from Googlebot. Google-Extended collects data for training Gemini. Blocking it doesn't affect normal Google Search indexing.

FacebookBot

Meta

User-agent: FacebookBot

Used by: Meta AI, Facebook link previews

FacebookBot crawls for link previews and Meta's AI features. It's important for social media visibility.

For a complete list of AI bot user-agents with technical details, see our AI Bot User-Agents Reference.

Basic robots.txt Syntax

The robots.txt file uses a simple syntax with just a few commands:

User-agent

Specifies which bot the following rules apply to. Use * for all bots.

User-agent: GPTBot
User-agent: *

Disallow

Tells bots NOT to access specific paths. Use / to block everything.

Disallow: /admin/
Disallow: /private/
Disallow: /

Allow

Tells bots they CAN access specific paths. Use this to override a broader Disallow rule.

Disallow: /admin/
Allow: /admin/public/

Crawl-delay

Sets a delay in seconds between bot requests. Not supported by all bots.

Crawl-delay: 10

Sitemap

Points bots to your XML sitemap for better crawling efficiency.

Sitemap: https://yoursite.com/sitemap.xml

Common robots.txt Configurations

Here are ready-to-use configurations for common scenarios:

Allow All AI Bots (Recommended for Most Sites)

This configuration welcomes all AI search engines while protecting admin areas:

# Allow all AI bots to crawl
User-agent: *
Allow: /

# Block private areas for all bots
Disallow: /admin/
Disallow: /api/
Disallow: /login/
Disallow: /dashboard/

# Sitemap location
Sitemap: https://yoursite.com/sitemap.xml

Block AI Training, Allow AI Search

Block bots used for training AI models while allowing search bots:

# Block training bots
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Allow search bots
User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

# Default rules for other bots
User-agent: *
Allow: /
Disallow: /admin/

Sitemap: https://yoursite.com/sitemap.xml

Selective Content Access

Allow AI bots to access blog content but not product pages:

# AI bots can access blog
User-agent: GPTBot
Allow: /blog/
Disallow: /

User-agent: ClaudeBot
Allow: /blog/
Disallow: /

# Default rules
User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Block All AI Bots

If you want to opt out of AI search entirely (not recommended for visibility):

# Block all known AI bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: CCBot
Disallow: /

# Allow traditional search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Best Practices

Do These

Place robots.txt in your root directory

Use one rule per line

Include your sitemap location

Test your robots.txt after changes

Allow AI bots for better GEO visibility

Keep the file under 500KB

Avoid These

Using robots.txt for security

Blocking all bots without reason

Using regular expressions (not supported)

Forgetting to update after site changes

Blocking CSS/JS needed for page rendering

Creating multiple robots.txt files

Testing Your robots.txt

Always test your robots.txt file before deploying it. Use these methods:

Manual Testing

Visit yoursite.com/robots.txt in your browser to verify:

  • The file is accessible and loads correctly
  • There are no syntax errors or typos
  • All user-agent names are spelled correctly
  • Paths match your actual site structure

Google Search Console

Use Google's robots.txt Tester tool:

  • Go to Google Search Console
  • Navigate to Crawl → robots.txt Tester
  • Test specific URLs against your rules
  • Check for errors and warnings

Online Validators

Use third-party robots.txt validators:

  • Robots.txt Checker: Check syntax and coverage
  • Bloffee GEO Analyzer: Validates robots.txt as part of full site analysis
  • SEO Tools: Many SEO platforms include robots.txt testing

Server Log Monitoring

Check your server logs to verify bot behavior:

  • Look for AI bot user-agent strings in access logs
  • Verify bots are respecting your rules
  • Identify any unauthorized crawling
  • Monitor crawl frequency and patterns

Advanced Configurations

Rate Limiting with Crawl-delay

Control how fast bots crawl your site to reduce server load:

User-agent: GPTBot
Crawl-delay: 10
Allow: /

User-agent: ClaudeBot
Crawl-delay: 5
Allow: /

Note: Not all bots support crawl-delay. It's more reliable to use server-side rate limiting.

Wildcard Patterns

Use wildcards to match multiple paths (supported by most modern bots):

User-agent: *
# Block all PDF files
Disallow: /*.pdf$

# Block all URLs with query parameters
Disallow: /*?

# Block all admin pages
Disallow: /*/admin/

Multiple Sitemaps

List multiple sitemaps for different content types:

Sitemap: https://yoursite.com/sitemap-pages.xml
Sitemap: https://yoursite.com/sitemap-blog.xml
Sitemap: https://yoursite.com/sitemap-products.xml
Sitemap: https://yoursite.com/sitemap-images.xml

robots.txt Quick Tips

  • Start with allowing all AI search bots for maximum visibility
  • Only block specific bots if you have a strong reason
  • Always include your sitemap location
  • Test changes before deploying to production
  • Monitor bot access in your server logs
  • Update robots.txt when you change site structure
  • Remember: robots.txt is not a security measure

Impact on Your GEO-Score

Your robots.txt configuration directly affects your AI Bot Access score, which is a key component of your overall GEO-Score.

Bloffee checks your robots.txt for:

  • Whether AI bots can access your content
  • Proper syntax and formatting
  • Accidental blocking of important pages
  • Sitemap declaration
  • Overly restrictive rules that hurt visibility

A well-configured robots.txt that welcomes AI bots can improve your GEO-Score by 10-15 points. Blocking important bots can reduce your score by 20-30 points or more.

Related Topics