AI Crawlability Checker - robots.txt, HTTP Status Check, and NOINDEX Checks

robots.txt Is the First Gate

Before any AI system can index, extract, or reference your content, it needs crawl permission. That permission is requested—but not enforced—through your robots.txt file.

Important: robots.txt is advisory guidance that crawlers are requested to honor, not a technical access control mechanism. Most major AI providers respect these rules, but compliance is not guaranteed by the protocol.

The challenge is that robots.txt rules are often written broadly—a wildcard directive intended to manage one type of bot can inadvertently block dozens of others. AI crawlers have proliferated rapidly, and many site owners haven't revisited their robots.txt configurations in years.

The consequence: you may be blocking AI systems from accessing your content without realizing it, reducing your presence in AI-powered search results and limiting how often your content surfaces in AI-generated responses.

And even when bots can crawl, pages may still be excluded from indexing via noindex directives (meta robots or X-Robots-Tag). That is why this audit checks both access and indexability signals.

The Finding Most Tools Miss

robots.txt is only half the picture. A site can explicitly allow a bot in its robots.txt policy — and still return a 429 (Too Many Requests) or 5xx error when that bot actually makes an HTTP request. The robots policy says "welcome." The server says "go away."

The mismatch signal: We check both the robots.txt policy and the live HTTP response for each major AI crawler. When the two disagree — allow in robots.txt but blocked at runtime — that's the most common hidden AI crawlability failure. WAF rules, Cloudflare bot protection, rate limiters, and CDN edge rules frequently cause this silently. No other tool checks for this across 33 AI user agents.

This is the "aha" finding in most audits: site owners who believe they're accessible to AI crawlers, because their robots.txt says so, but whose servers are actively rejecting the same crawlers at runtime.

33 AI Crawlers, One Report

We retrieve your robots.txt file, test it against user agents from major AI providers, run an HTTP Status Check for Browser + GPTBot + ClaudeBot + PerplexityBot, and detect page-level noindex directives. Scope: Origin-level and URL-level checks.

Training & Dataset Crawlers

CCBot (Common Crawl)
GPTBot (OpenAI training)
Google-Extended (Gemini training)
anthropic-ai (Claude training)
Amazonbot
FacebookBot

Blocking these may reduce your content's presence in training data used to build future AI models.

Search & Browsing Agents

PerplexityBot
OAI-SearchBot (ChatGPT Search)
ChatGPT-User (ChatGPT browsing)
YouBot (You.com)

Blocking these prevents your site from appearing in real-time AI search results and AI browser lookups.

Research & Specialized Crawlers

ClaudeBot
Applebot-Extended (Apple Intelligence)
Bytespider (ByteDance)
Diffbot
Meta-ExternalAgent

These systems may power features in consumer products or contribute to open research datasets.

Common robots.txt Misconfigurations

Example 1 — Accidental Global Block

User-agent: *
Disallow: /

Impact: Blocks all crawlers, including every AI user agent. Fix: Use specific user-agent directives instead of a wildcard disallow—or allow specific bots you want to permit.

Example 2 — No robots.txt (Default Behavior)

# No robots.txt file present

Impact: All crawlers are allowed by default. This is the most permissive state—no action needed if you want full AI crawler access.

Example 3 — Selective Block

User-agent: GPTBot
Disallow: /

Impact: Blocks only GPTBot. All other crawlers remain unaffected. Use this pattern when you want to restrict specific providers.

Example 4 — Allow Specific Bots After Wildcard Block

User-agent: *
Disallow: /

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

Impact: Wildcard blocks all crawlers, but explicitly permits GPTBot and ClaudeBot. Useful when you have a broad block but want to allow selected AI crawlers.

How to Read Your Results

HTTP Status Check codes are interpreted as follows: 2xx = Reachable (good), 3xx = Redirected (good), 4xx = warning, 429/5xx = critical, and timeout/no response = unavailable. Redirect notes are shown when the final URL differs from the input URL after normalization (including trailing slash differences). On 429, a Retry-After value is shown for the Browser row when the server provides it.

Good — All Major Bots Allowed

No obvious access restrictions detected. Your robots.txt permits the major AI crawlers we check, and sampled HTTP status responses are healthy (2xx/3xx). Review your configuration periodically as the crawler landscape evolves.

Warning — Some Bots Blocked

Selective blocking or warning-level HTTP responses are present. Some AI user agents are restricted, redirected, or returning 4xx responses. Verify this is intentional and review edge rules.

Mismatch — Allowed in robots.txt, Blocked at Runtime

The most actionable finding: your robots.txt policy allows the bot, but the live HTTP response returns a 429, 5xx, or connection error. WAF rules, CDN bot protection, or rate limiters are the most common cause. Fix the runtime block — the robots.txt policy alone is not enough.

Risk — Major Bots Blocked

Significant access restrictions or critical runtime responses are likely reducing your AI visibility. Review robots.txt, then investigate 429/5xx errors and anti-bot controls if they are unintentional.

What robots.txt Can and Can't Do

What robots.txt Can Do

Request that crawlers avoid specific paths or the entire site
Signal your preferences to compliant, well-behaved bots
Help reduce server load from unwanted crawling
Differentiate instructions per user agent

What robots.txt Cannot Do

Enforce access restrictions—it is advisory only
Prevent determined bad actors from accessing your content
Guarantee that any given crawler will comply
Replace authentication or server-level access controls

"The protocol is not a form of access authorization."
— RFC 9309, Robots Exclusion Protocol

Frequently Asked Questions

No. The absence of a robots.txt file means all crawlers are allowed by default. There is nothing to block them. If you want to restrict specific crawlers, you need to create a robots.txt file with the appropriate directives.

Yes. Use specific user-agent directives rather than wildcards. For example, you can block GPTBot while allowing ClaudeBot and PerplexityBot by writing separate rules for each user agent. Each AI provider publishes their crawler's user agent string in their documentation.

Most AI crawlers operate separately from traditional search crawlers. Blocking GPTBot, for instance, does not affect Googlebot. However, blocking Google-Extended may affect your content's use in Google's AI features. Always check which user agent corresponds to which product before adding restrictions.

Most major providers—OpenAI, Anthropic, Google, and others—document that their crawlers respect robots.txt directives. Compliance is not technically enforced, however. The robots exclusion protocol is an honor system. Well-resourced, reputable companies generally follow it; less scrupulous actors may not.

This is a strategic decision that depends on your goals. Blocking training crawlers (like GPTBot or CCBot) may give you some control over whether your content appears in future AI model training datasets. Blocking search crawlers (like PerplexityBot or OAI-SearchBot) reduces your visibility in those AI search products. Consider what outcome you want before adding restrictions.

The landscape changes frequently as new AI products launch and existing products expand their crawling capabilities. We update our crawler list periodically to reflect new user agents and changes to existing ones. Check back regularly if you actively manage crawler access.

robots.txt controls crawl access requests at the crawler level (usually origin-wide). noindex is a page-level indexing directive sent via meta robots or X-Robots-Tag. A page can be crawlable but still excluded from index-like systems if noindex is present.

Yes. If a page includes a noindex directive, it may be excluded even when crawl access is allowed. That's why the report now includes both bot access and noindex status.

Can AI Crawlers Access Your Site?