Enter one or more URLs (comma-separated) to check compliance policies
Learn more: Complete list of AI crawlers | RSL Standard
Analyzing...
Results
| URL ↕ | Path ↕ | RSL ↕ | Markdown ↕ | Content Signals ↕ | GPTBot ↕ | ClaudeBot ↕ | Google-Extended ↕ | CCBot ↕ | All Bots |
|---|
📚 AI Crawler Reference — What each bot does
Training Crawlers (16)
• ClaudeBot — Anthropic (Model training)
• anthropic-ai — Anthropic (Bulk model training)
• Claude-Web — Anthropic (Web-focused training)
• Google-Extended — Google (Gemini training)
• GoogleOther — Google (Research & development)
• Meta-ExternalAgent — Meta (AI model training)
• FacebookBot — Meta (Speech recognition training)
• Applebot-Extended — Apple (Generative AI training)
• Amazonbot — Amazon (AI improvement, model training)
• CCBot — Common Crawl (Open dataset collection)
• Bytespider — ByteDance (AI training)
• cohere-ai — Cohere (LLM training)
• Diffbot — Diffbot (AI data extraction)
• Omgilibot — Webz.io (Data collection for resale)
• ImagesiftBot — The Hive (Image model training)
Search Crawlers (4)
• PerplexityBot — Perplexity (Search indexing)
• YouBot — You.com (AI search)
• DuckAssistBot — DuckDuckGo (AI-assisted answers)
Other (6)
• Perplexity-User — Perplexity (User-requested fetching)
• Meta-ExternalFetcher — Meta (Real-time content fetching)
• Applebot — Apple (Siri, Spotlight, Safari)
• Google-CloudVertexBot — Google (Cloud AI services)
• Amzn-SearchBot — Amazon (Alexa and Rufus search)
📋 Note on Google-Extended
The UK CMA has proposed giving publishers the ability to opt out of AI training and AI Overviews (consultation closes Feb 2026). Blocking Google-Extended via robots.txt is currently how publishers exercise this control.
🎯 Content Signals (Cloudflare AI Policy)
Content Signals allow sites to express AI usage preferences directly in robots.txt. Adopted by 3.8M+ domains using Cloudflare's managed robots.txt.
Traditional search indexing (not AI summaries)
AI search, RAG, grounding
Model training & fine-tuning
⚠️ Note: Some sites block policy checkers
PolicyCheck runs from cloud infrastructure, and some sites (like Medium) block datacenter IPs to prevent scraping. This creates a paradox: robots.txt exists for bots to check policies, but compliance checkers may be blocked from accessing it.
Legitimate crawlers (Googlebot, GPTBot, ClaudeBot) solve this with IP whitelisting and reverse DNS verification. Most sites work fine, but aggressive blockers may fail. Learn more