The Crawlability tab

The Crawlability tab checks the first gate: can the major AI systems actually reach this page, or are they being blocked before they can read it?

What this tab checks

The Crawlability tab separates two different questions: what your site rules say, and what actually happens when a bot sends a request.

Site rules — BeSeenByAI fetches and parses your robots.txt, checking whether each major AI bot is explicitly allowed, explicitly blocked, or not mentioned (which defaults to allowed). It also checks for page-level noindex directives.

Live access — Separately, BeSeenByAI sends real fetch requests using the actual user agent strings AI bots send. This checks whether bots can reach your page in practice. A page that passes robots.txt can still fail live access due to firewalls, CDN bot protection, rate limits, or server errors.

TTFB — Response speed is checked alongside access. AI systems assembling answers from multiple sources tend to process the fastest responses first. A slow TTFB doesn’t block a crawler outright, but it makes you a less attractive source when an AI is racing to include you.

Why it matters

If AI bots cannot reach a page, nothing else in the report matters. Great content, strong structured data, and excellent performance will not help if the crawlers never get to read the page.

This is one of the most common invisible problems because “the page works in a browser” does not mean “AI crawlers can reach the page.” Bot protection tools, Cloudflare challenges, IP-based blocking, and 429 rate-limit responses are all invisible to a human visitor but fatal to AI crawlers.

The 3 types of AI bots

Not every AI bot does the same job. Understanding the difference helps you make deliberate decisions about which to allow.

Training crawlers — GPTBot, Google-Extended, anthropic-ai, CCBot, Amazonbot, FacebookBot. These feed model training datasets. Blocking them is a legitimate choice if you don’t want your content used for training, but it should be intentional.

Search and browsing agents — PerplexityBot, OAI-SearchBot, ChatGPT-User, YouBot. These fetch pages in real time to answer user queries right now. If you want to be cited when someone asks ChatGPT or Perplexity a question, these are the bots that need to reach you. Blocking these is almost never what you actually want.

Research and specialized crawlers — ClaudeBot, Applebot-Extended, Bytespider, Diffbot, Meta-ExternalAgent. These power a mix of consumer features and research uses.

A blanket User-agent: * block often takes down search agents along with training crawlers — cutting yourself out of AI search results while trying to protect against training. The per-bot view in the report shows exactly which category each block falls into.

The 4 verdict states

The report rolls the findings into one top-level verdict.

Good — robots.txt allows the major AI crawlers, HTTP responses are healthy, TTFB is competitive, and no surprising noindex directives are present.

Warning — Some selective blocking is present, some 4xx responses are showing up, or TTFB is slow enough to put you at a disadvantage. Often legitimate, sometimes not. Open the report and confirm.

Mismatch — robots.txt says allow, but the live request says no. Almost always an accidental block — the most actionable verdict.

Risk — Significant blocking at the policy layer, widespread critical responses at the runtime layer, or both. If unintentional, AI visibility is meaningfully suppressed right now.

How to read the HTTP Status Check

The HTTP Status Check makes a live request as Browser, GPTBot, ClaudeBot, and PerplexityBot, and records what your server actually returns. This is the most commonly missed part of a crawlability audit because it checks what your infrastructure does in practice, not what your robots.txt says.

Status Meaning
2xx Reachable — the crawler gets through
3xx Redirected — also fine; a note appears if the final URL differs
4xx Warning — something is rejecting the request; check if it’s intentional
429 Critical — rate-limited or soft-blocked
5xx Critical — server error specific to that user agent, almost always a sign of runtime blocking
Timeout Unavailable — may be intermittent; worth re-running

The pattern that matters most: 429 or 5xx returned to an AI bot while Browser gets a healthy 200. That is the mismatch, and it is the most common hidden crawlability failure.

The mismatch finding

When robots.txt allows a bot but the HTTP Status Check shows it being rejected, the report flags this as a mismatch. Your policy says allow; your infrastructure says no. The two layers are out of sync.

Common causes:

  • Cloudflare Bot Fight Mode — on by default on every plan; classifies many AI crawlers as automated traffic to be challenged or blocked
  • Custom WAF rules targeting “AI scrapers” or generic bot user-agent patterns, often shipped as one-click defaults
  • Managed WordPress hosts running platform-level bot mitigation that customers can’t see through the admin portal
  • Security plugins with built-in blocklists that include AI bot user agents
  • Rate limiters tuned for human traffic that flag legitimate crawl bursts as abuse
  • IP-based blocks on cloud IP ranges — AI crawlers run from cloud infrastructure, so generic cloud blocks catch them by default

Common issues and what to do

All major AI bots are blocked in robots.txt Look for User-agent: * with Disallow: /, or explicit blocks like User-agent: GPTBot with Disallow: /. If you want to block training crawlers but keep search agents, use per-user-agent rules rather than a wildcard block. If the block is unintentional, update robots.txt.

Robots.txt is fine but live access fails (mismatch) The problem is in your infrastructure. To find the source, run curl -I against your URL with an AI bot user agent and examine the response headers. Then:

  • On Cloudflare: check Security → Bots for Bot Fight Mode, and Security → WAF for custom rules targeting AI user agents
  • On managed WordPress: check security plugins for bot blocklists; if nothing explains it, open a support ticket with curl evidence
  • On your own infrastructure: review WAF rules and rate limiters for cloud-IP-range blocks or generic “automated traffic” patterns

Some bots are allowed, others are blocked Review the per-bot breakdown to see exactly which bots are affected and which rule applies. A wildcard rule may be catching bots you didn’t intend to block.

noindex is set on the page Find the source in your CMS or templates. The harder cases are CDN-injected X-Robots-Tag headers, which require checking your edge configuration rather than the page source.

TTFB is slow The fix is on the performance side. Enable caching at the edge, add a CDN if you don’t have one, and check whether your server is doing expensive uncached work on every request. See the Performance tab guide for grade thresholds.

When to re-run the audit

Crawlability can change without any action on your part. Re-run after:

  • Any change to robots.txt
  • Enabling, disabling, or reconfiguring a WAF or bot management feature
  • Migrating hosting providers or adding a CDN
  • Installing a new security plugin
  • The audit surfaced a 429 or 5xx — these can be transient; re-running confirms whether the issue is consistent or cleared on its own

Key caveats

  • Robots.txt is advisory, not enforced. Most AI bots respect it, but it is not a technical barrier.
  • Passing robots.txt does not guarantee live access. Always check both sections of the tab.
  • BeSeenByAI checks real-world access at audit time. A page accessible last week may fail today if a bot protection rule changed.

Which AI bots does BeSeenByAI check?

BeSeenByAI checks 33 AI crawlers including GPTBot and OAI-SearchBot (OpenAI/ChatGPT), ClaudeBot and Claude-SearchBot (Anthropic), PerplexityBot, Google-Extended (Google AI features), Googlebot, and crawlers from Amazon, Apple, Meta, Mistral, DuckDuckGo, Common Crawl, and others.

For the full list with user agent strings and what blocking each bot means, see the AI Bots reference.

For a deeper look at how crawlability differs from SEO and why infrastructure does more of the deciding than robots.txt, see the blog post What is Crawlability and Why AI Can’t Cite You Without It.