What is AI Bot Access?
AI Bot Access measures whether AI crawlers — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot, ChatGPT-User and others — can actually reach and read your pages. The check looks at four layers: robots.txt directives, server response codes, IP-based blocking from CDNs and WAFs, and whether content is locked behind JavaScript or paywalls. Each layer can silently kill AI visibility, and many sites are blocked at one or more without realising it.
This metric is the gatekeeper for the entire GEO-Score. A perfect 100/100 on schema, citations, freshness, and structure delivers exactly zero AI citations if a single Disallow line in robots.txt or a default WAF rule turns crawlers away. Cloudflare reported in July 2025 that its network now blocks AI crawlers by default for new customers — meaning a meaningful slice of the web went dark for AI overnight.
Why AI Bot Access Matters
AI search is now a measurable share of total web traffic, but it is also the most fragile traffic source — one misconfigured rule can erase your presence from ChatGPT, Claude, and Perplexity simultaneously. Three forces explain why bot access deserves attention before any other GEO work.
Bot Access Is a Binary Gatekeeper
AI crawlers do not partially index a blocked site — they skip it entirely. If GPTBot, ClaudeBot, or PerplexityBot receives a 403, a robots.txt Disallow, or a WAF challenge, the page is treated as non-existent for AI answers. There is no "reduced visibility" outcome: it is full citation eligibility or none at all.
Most Blocking Is Accidental
Originality.ai found GPTBot is now blocked by 35.7% of the top 1,000 websites, but interviews with site owners show many of those blocks were inherited from default WAF rule sets, copy-paste robots.txt templates, or CDN bot-fight modes that classify GPTBot as a generic scraper. Few of these owners set out to block AI; they simply forgot to allow it.
AI Crawlers Are Aggressive — But Selective
Cloudflare reported GPTBot grew 305% in raw requests between May 2024 and May 2025, while PerplexityBot grew 157,490% from a small base. That volume comes with a budget: bots prioritise sites that respond fast, return 200s, and serve content in initial HTML. Sites that intermittently 5xx, hide content behind JavaScript, or rate-limit AI bots see citations drop even without an explicit block.
What the Research Says
GPTBot increased its share of all crawler traffic from 2.2% to 7.7%, with a 305% rise in raw requests over 12 months — jumping from rank #9 to rank #3 among all web crawlers. PerplexityBot showed the most explosive growth at 157,490% from a minimal baseline. Yet only 14% of analyzed domains had any specific robots.txt directives targeting AI bots — leaving the other 86% silently allowing or blocking AI traffic by accident.
João Tomé, Jorge Pacheco, Carlos Azevedo — From Googlebot to GPTBot: Who's Crawling Your Site in 2025, Cloudflare Blog, July 2025 — analysis of 3,816 top domains
GPTBot is now blocked by 35.7% of the top 1,000 websites, up from just 5% when it was first introduced in August 2023. The percentage of sites blocking GPTBot was increasing by approximately 5% per week in the early stages following the bot's announcement. Many of these blocks were inherited from default templates and CDN rules rather than deliberate policy decisions.
Originality.ai — GPTBot Blocking Tracker, August 2024 update — quarterly study of the Quantcast top 1,000 websites since GPTBot launch
Anthropic's crawl-to-referral ratio peaked near 500,000:1 early in 2025 before settling between 25,000:1 and 100,000:1, while OpenAI's GPTBot ratio spiked to roughly 3,700:1 in March 2025. This imbalance — bots taking far more than they return in human visits — is the main reason publishers are tempted to block, but for any site that is not a major news brand, blocking removes the only path to AI search citations entirely.
Cloudflare Radar — The crawl-to-click gap: AI bots, training, and referrals, 2025 — multi-month analysis of crawler-to-human-referral ratios across the Cloudflare network
3 Real-World Bot Access Scenarios
These three patterns show how the same content can be invisible or fully citable to AI depending on a few configuration lines. Each "bad" case is a real pattern observed in audits — the "good" version is the minimum fix that keeps content protected where it should be while letting AI bots through everywhere else.
Example 1: Regional News Site With Default Robots.txt
A regional news publisher uses a CMS template that ships with a robots.txt containing User-agent: GPTBot / Disallow: / and User-agent: ClaudeBot / Disallow: /. The editorial team is not aware these lines exist. The site has high E-E-A-T, daily updated articles, and good schema, but in 18 months ChatGPT and Claude have never cited a single article. Server logs confirm GPTBot is hitting /robots.txt every few hours and walking away.
Why this fails: The Disallow on root path tells GPTBot and ClaudeBot to skip the entire domain. Both bots respect robots.txt, so all the editorial investment produces zero AI citations. The publisher cannot understand why competitors with weaker content are cited daily — until someone reads the robots.txt.
The publisher rewrites robots.txt to: User-agent: GPTBot / Allow: / / User-agent: OAI-SearchBot / Allow: / / User-agent: ChatGPT-User / Allow: / / User-agent: ClaudeBot / Allow: / / User-agent: PerplexityBot / Allow: / / Sitemap: https://news.example.com/sitemap.xml. Server logs are sampled weekly to confirm 200 responses and to track crawl frequency per bot. Within four weeks, ChatGPT search starts citing recent articles by name.
Why this works: Explicit Allow rules override any inherited template defaults and signal intent to every AI crawler. Listing both training bots (GPTBot, ClaudeBot) and search-time bots (OAI-SearchBot, ChatGPT-User) covers both training-data citations and live answer fetches. The sitemap line tells crawlers exactly which URLs to prioritise — so new articles surface in AI answers within days, not months.
Example 2: E-commerce Brand Behind a Default WAF
A mid-size e-commerce brand on Cloudflare has a clean robots.txt that allows all AI bots. But its WAF has "Block AI bots" enabled in Super Bot Fight Mode and a custom rule blocking any user-agent containing "bot" that is not Googlebot or Bingbot. AI crawlers receive 403 Forbidden responses on every request. Product listings, buying guides, and category pages never enter AI training data or live search indexes.
Why this fails: Robots.txt is honest, but the WAF executes first. Cloudflare's documentation explicitly states that the AI bot block rule takes precedence over Allow Verified Bots — so even AI crawlers Cloudflare has verified by IP get blocked. The brand sees zero ChatGPT or Perplexity referrals even though its content quality scores are excellent.
The brand disables the blanket "Block AI bots" toggle and instead creates a Cloudflare AI Crawl Control allow-list for GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, PerplexityBot, and Google-Extended. The WAF custom rule is rewritten to challenge unverified bots while letting verified AI crawlers through. A monthly review checks AI crawl logs, and any new commercially relevant AI bot is added to the allow-list within 7 days.
Why this works: Verified AI bots arrive from published IP ranges that Cloudflare authenticates — the allow-list trusts the bot identity, not just the user-agent string (which scrapers can fake). The brand keeps its protection against malicious scrapers while opening the door to every AI engine that can drive purchases. Within a quarter, the brand starts appearing in ChatGPT shopping responses for buying-intent queries.
Example 3: Subscription Publisher With Hard Paywall
A B2B subscription publication shows a 50-word teaser then a full-page login modal, served via JavaScript on page load. AI crawlers including GPTBot do not execute JavaScript, so they see the teaser plus the modal HTML. Articles are never trained on, and at search time AI engines have nothing to cite — they fall back to competitor sources who write about the same topics in the open. Subscription growth slows because the brand never appears in AI answers where decision-makers research vendors.
Why this fails: AI crawlers fetch raw HTML only. A JavaScript-injected paywall is invisible to humans (it loads after) but blocks AI completely — they see only the 50-word teaser. There is no path for the publisher's expertise to enter AI training data or live answer pipelines, even though the editorial quality is the highest in the industry.
The publisher introduces a 250-word "executive summary" rendered in initial HTML for every article: the key finding, the data point, the recommendation, and the source. The full deep-dive analysis stays paywalled. Robots.txt allows GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and Google-Extended. Schema.org Article markup with paywalledContent annotation tells crawlers which sections require subscription, while the summary section is freely indexable.
Why this works: AI crawlers now have substantive, citable content for every article — the summary is long enough to be a complete answer (per the Answer Completeness research, 200-word standalone passages are ideal). When a decision-maker asks ChatGPT "who is the leading source on X", the publisher's summary is cited and the full report click-through converts. The paywall protects subscription revenue while AI becomes a top-of-funnel acquisition channel.
How to Improve Your AI Bot Access Score
Do NOT Do This
- ✗Use User-agent: * / Disallow: / or any global block in robots.txt — this kills AI access for every crawler in one line, including the ones you want
- ✗Leave default WAF "Block AI bots" toggles enabled without reviewing — Cloudflare and other CDNs increasingly ship with AI blocking on by default, including for verified bots
- ✗Block by user-agent string alone — scrapers fake "GPTBot" easily, and legitimate bots can be impersonated; verify by IP range or use CDN-verified bot lists instead
- ✗Lock primary content behind JavaScript-rendered components or single-page-app routes — GPTBot, ClaudeBot, and PerplexityBot do not execute JavaScript and will see only the initial HTML shell
- ✗Skip server-log monitoring of AI bots — without weekly checks of GPTBot, ClaudeBot, and PerplexityBot hits, accidental blocks can persist for months before anyone notices the missing AI traffic
Do This Instead
- ✓Add explicit User-agent: GPTBot / Allow: /, plus equivalents for ClaudeBot, PerplexityBot, OAI-SearchBot, ChatGPT-User, Google-Extended, and Applebot-Extended in robots.txt
- ✓Whitelist verified AI bots in your WAF using their published IP ranges — Cloudflare AI Crawl Control, Vercel AI Bot Manager, and Akamai all expose this
- ✓Server-side render or pre-render the first 200-500 words of every important page so AI crawlers see substantive content in the initial HTML response
- ✓Sample server logs weekly for GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and Google-Extended hits — confirm 200 responses, average response time under 2 seconds, and steady crawl frequency
- ✓If you have a paywall, expose a 200-300 word executive summary in HTML and use Schema.org paywalledContent to mark the protected sections — preserves revenue while keeping AI citation eligibility
Quick Tips for AI Bot Access
- •Always use explicit Allow rules per AI bot — "User-agent: * / Allow: /" appears permissive but does not signal intent and many WAFs override it
- •Check your CDN dashboard before robots.txt — Cloudflare's July 2025 change blocks AI crawlers by default for new customers, regardless of what your robots.txt says
- •Allow both training bots (GPTBot, ClaudeBot) and search-time bots (OAI-SearchBot, ChatGPT-User, Claude-SearchBot, PerplexityBot) — they serve different parts of the AI answer pipeline
- •Do not rely on llms.txt as your primary access mechanism — as of late 2025 no major AI crawler reads it; robots.txt remains the only universally honoured standard
- •Render critical content server-side or via SSG — only Googlebot reliably executes JavaScript among major crawlers, so JS-only content is invisible to GPTBot, ClaudeBot, and PerplexityBot
- •Sample your access logs weekly for the AI user-agent strings — a sudden drop to zero is the earliest signal of an accidental block from a CDN update or WAF rule change
Frequently Asked Questions
Should I block GPTBot to protect my content from AI training?
What is the impact of allowing AI bots on my GEO-Score?
What is the difference between GPTBot, ChatGPT-User, and OAI-SearchBot?
Does blocking Google-Extended affect my Google Search rankings?
Why are AI bots crawling my site so much without sending traffic back?
Should I implement an llms.txt file alongside robots.txt?
Related Metrics to Explore
- Page Speed
Slow responses cause AI crawlers to time out — page speed turns access from "allowed" into "actually crawlable"
- Sitemap & Discoverability
Once bots can access your site, your sitemap and link structure determine which pages they actually find
- Schema Validator
Schema markup helps AI crawlers interpret accessible pages — including paywalledContent annotations for hybrid models
- AI Optimization
The umbrella score that combines bot access, schema, structure, and freshness into a single AI-readiness signal