PolicyCheck - AI Crawler Policy Scanner

Enter one or more URLs (comma-separated) to check compliance policies

Learn more: Complete list of AI crawlers | RSL Standard

Results

URL ↕	Path ↕	RSL ↕	Markdown ↕	Content Signals ↕	GPTBot ↕	ClaudeBot ↕	Google-Extended ↕	CCBot ↕	All Bots

📚 AI Crawler Reference — What each bot does

Training Crawlers (16)

• GPTBot — OpenAI (Model training)
• ClaudeBot — Anthropic (Model training)
• anthropic-ai — Anthropic (Bulk model training)
• Claude-Web — Anthropic (Web-focused training)
• Google-Extended — Google (Gemini training)
• GoogleOther — Google (Research & development)
• Meta-ExternalAgent — Meta (AI model training)
• FacebookBot — Meta (Speech recognition training)
• Applebot-Extended — Apple (Generative AI training)
• Amazonbot — Amazon (AI improvement, model training)
• CCBot — Common Crawl (Open dataset collection)
• Bytespider — ByteDance (AI training)
• cohere-ai — Cohere (LLM training)
• Diffbot — Diffbot (AI data extraction)
• Omgilibot — Webz.io (Data collection for resale)
• ImagesiftBot — The Hive (Image model training)

Search Crawlers (4)

• OAI-SearchBot — OpenAI (ChatGPT search indexing)
• PerplexityBot — Perplexity (Search indexing)
• YouBot — You.com (AI search)
• DuckAssistBot — DuckDuckGo (AI-assisted answers)

Other (6)

• ChatGPT-User — OpenAI (User-requested fetching)
• Perplexity-User — Perplexity (User-requested fetching)
• Meta-ExternalFetcher — Meta (Real-time content fetching)
• Applebot — Apple (Siri, Spotlight, Safari)
• Google-CloudVertexBot — Google (Cloud AI services)
• Amzn-SearchBot — Amazon (Alexa and Rufus search)

📋 Note on Google-Extended

The UK CMA has proposed giving publishers the ability to opt out of AI training and AI Overviews (consultation closes Feb 2026). Blocking Google-Extended via robots.txt is currently how publishers exercise this control.

→ CMA case details → Cloudflare analysis

🎯 Content Signals (Cloudflare AI Policy)

Content Signals allow sites to express AI usage preferences directly in robots.txt. Adopted by 3.8M+ domains using Cloudflare's managed robots.txt.

search=yes/no

Traditional search indexing (not AI summaries)

ai-input=yes/no

AI search, RAG, grounding

ai-train=yes/no

Model training & fine-tuning

→ Learn about Content Signals

⚠️ Note: Some sites block policy checkers

PolicyCheck runs from cloud infrastructure, and some sites (like Medium) block datacenter IPs to prevent scraping. This creates a paradox: robots.txt exists for bots to check policies, but compliance checkers may be blocked from accessing it.

Legitimate crawlers (Googlebot, GPTBot, ClaudeBot) solve this with IP whitelisting and reverse DNS verification. Most sites work fine, but aggressive blockers may fail. Learn more