Crawlability is more than robots.txt. We check 33 AI user agents, run HTTP Status Check probes for Browser + GPTBot + ClaudeBot + PerplexityBot, and detect noindex directives that can suppress visibility.
Before any AI system can index, extract, or reference your content, it needs crawl permission. That permission is requested—but not enforced—through your robots.txt file.
The challenge is that robots.txt rules are often written broadly—a wildcard directive intended to manage one type of bot can inadvertently block dozens of others. AI crawlers have proliferated rapidly, and many site owners haven't revisited their robots.txt configurations in years.
The consequence: you may be blocking AI systems from accessing your content without realizing it, reducing your presence in AI-powered search results and limiting how often your content surfaces in AI-generated responses.
And even when bots can crawl, pages may still be excluded from indexing via noindex directives (meta robots or X-Robots-Tag). That is why this audit checks both access and indexability signals.
We retrieve your robots.txt file, test it against user agents from major AI providers, run an HTTP Status Check for Browser + GPTBot + ClaudeBot + PerplexityBot, and detect page-level noindex directives. Scope: Origin-level and URL-level checks.
Blocking these may reduce your content's presence in training data used to build future AI models.
Blocking these prevents your site from appearing in real-time AI search results and AI browser lookups.
These systems may power features in consumer products or contribute to open research datasets.
User-agent: *
Disallow: /
Impact: Blocks all crawlers, including every AI user agent. Fix: Use specific user-agent directives instead of a wildcard disallow—or allow specific bots you want to permit.
# No robots.txt file present
Impact: All crawlers are allowed by default. This is the most permissive state—no action needed if you want full AI crawler access.
User-agent: GPTBot
Disallow: /
Impact: Blocks only GPTBot. All other crawlers remain unaffected. Use this pattern when you want to restrict specific providers.
User-agent: *
Disallow: /
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
Impact: Wildcard blocks all crawlers, but explicitly permits GPTBot and ClaudeBot. Useful when you have a broad block but want to allow selected AI crawlers.
HTTP Status Check codes are interpreted as follows: 2xx = Reachable (good), 3xx = Redirected (good), 4xx = warning, 429/5xx = critical, and timeout/no response = unavailable. Redirect notes are shown when the final URL differs from the input URL after normalization (including trailing slash differences). On 429, a Retry-After value is shown for the Browser row when the server provides it.
No obvious access restrictions detected. Your robots.txt permits the major AI crawlers we check, and sampled HTTP status responses are healthy (2xx/3xx). Review your configuration periodically as the crawler landscape evolves.
Selective blocking or warning-level HTTP responses are present. Some AI user agents are restricted, redirected, or returning 4xx responses. Verify this is intentional and review edge rules.
Significant access restrictions or critical runtime responses are likely reducing your AI visibility. Review robots.txt, then investigate 429/5xx errors and anti-bot controls if they are unintentional.
"The protocol is not a form of access authorization."
— RFC 9309, Robots Exclusion Protocol
No. The absence of a robots.txt file means all crawlers are allowed by default. There is nothing to block them. If you want to restrict specific crawlers, you need to create a robots.txt file with the appropriate directives.
Yes. Use specific user-agent directives rather than wildcards. For example, you can block GPTBot while allowing ClaudeBot and PerplexityBot by writing separate rules for each user agent. Each AI provider publishes their crawler's user agent string in their documentation.
Most AI crawlers operate separately from traditional search crawlers. Blocking GPTBot, for instance, does not affect Googlebot. However, blocking Google-Extended may affect your content's use in Google's AI features. Always check which user agent corresponds to which product before adding restrictions.
Most major providers—OpenAI, Anthropic, Google, and others—document that their crawlers respect robots.txt directives. Compliance is not technically enforced, however. The robots exclusion protocol is an honor system. Well-resourced, reputable companies generally follow it; less scrupulous actors may not.
This is a strategic decision that depends on your goals. Blocking training crawlers (like GPTBot or CCBot) may give you some control over whether your content appears in future AI model training datasets. Blocking search crawlers (like PerplexityBot or OAI-SearchBot) reduces your visibility in those AI search products. Consider what outcome you want before adding restrictions.
The landscape changes frequently as new AI products launch and existing products expand their crawling capabilities. We update our crawler list periodically to reflect new user agents and changes to existing ones. Check back regularly if you actively manage crawler access.
robots.txt controls crawl access requests at the crawler level (usually origin-wide). noindex is a page-level indexing directive sent via meta robots or X-Robots-Tag. A page can be crawlable but still excluded from index-like systems if noindex is present.
Yes. If a page includes a noindex directive, it may be excluded even when crawl access is allowed. That's why the report now includes both bot access and noindex status.
Enter any URL to review AI crawlability in under 30 seconds: 33-bot robots.txt parsing, HTTP Status Check (Browser + GPTBot + ClaudeBot + PerplexityBot), and page-level noindex status.
Run Free Audit