Steps to Manage AI Crawlers
If You Choose to Encourage AI Crawlers (Recommended for Group 1)
To maximize visibility in AI search results, follow InMotion Hosting’s guide to encourage AI crawlers:
1. Optimize Your robots.txt File
Update your robots.txt to allow crawlers like GPTBot, ClaudeBot, and PerplexityBot. Example:
# Allow beneficial AI crawlersUser-agent: GPTBotAllow: /User-agent: ClaudeBotAllow: /User-agent: PerplexityBotAllow: /
2. Test your robots.txt using Google Search Console to ensure it doesn’t block search engine bots.
3. Structure Content for AI
Use clear, concise text and structured data (e.g., schema markup) to make your content AI-friendly. Convert PDFs to Markdown, as LLMs process this format effectively. Example:
- Original PDF: Product catalog with detailed descriptions.
- Markdown Conversion: Bullet-pointed features, prices, and specifications.
4. Monitor Crawler Activity
Use server logs to track crawler visits (e.g., GPTBot, CCBot). InMotion Hosting is evaluating observability tools to provide insights into AI crawler behavior, though we’re not yet recommending specific solutions.
5. Leverage Rich Content
Don’t shy away from PDFs or multimedia. AI crawlers increasingly handle rich formats, and our Markdown conversion process ensures compatibility. For example, a product datasheet in Markdown can rank higher in AI responses.
6. Track AI Search Performance
Run control questions like ours to assess how AI platforms represent your brand. Adjust content based on whether competitors appear or if citations are accurate.
If You Choose to Block AI Crawlers (Considered for Group 2)
If you’re a Group 2 business or concerned about unauthorized data use, follow these steps to block AI crawlers:
1. Update Your robots.txt File
Add directives to disallow specific crawlers. Example:
# Block AI crawlersUser-agent: GPTBotDisallow: /User-agent: ClaudeBotDisallow: /User-agent: CCBotDisallow: /
2. Include open-source crawlers like Crawl4ai, Firecrawl, and Docling, which collect data for RAG and searches.
3. Implement Server-Level Blocking
Use a firewall or bot management solution (e.g., Cloudflare) to block crawler IP addresses or user agents. This is effective against rogue crawlers that ignore robots.txt, like some instances of Bytespider.
4. Add Meta Tags
Include “noai” and “noimageai” meta tags in your site’s header to signal that your content shouldn’t be used for AI training. Example:
<meta name="robots" content="noai, noimageai">
5. Monitor Server Performance
AI crawlers can strain servers, especially for large WordPress sites. Check server logs for high request volumes from bots like GPTBot (569 million requests monthly, per Vercel data) and block aggressive crawlers to maintain site speed.
6. Explore Licensing Options
Consider pay-per-crawl models, like Cloudflare’s beta program, to monetize your content. This allows you to charge AI companies for access while controlling usage.
Common AI Crawlers and Their Roles
Below is a table of common AI crawlers, including their purposes and behaviors:
Crawler | Description |
---|
GPTBot (OpenAI) | Collects data to train OpenAI’s LLMs, like ChatGPT. It respects robots.txt but crawls aggressively for content-rich sites. |
ChatGPT-User (OpenAI) | Fetches real-time data for ChatGPT user queries. It drives minimal traffic but enhances visibility in AI responses. |
ClaudeBot (Anthropic) | Gathers data to train Anthropic’s Claude model. It’s selective, targeting high-quality content and usually respects robots.txt. |
anthropic-ai (Anthropic) | A legacy crawler for Anthropic’s AI training, now retired. Demonstrates how providers use multiple bots for different tasks. |
CCBot (Common Crawl) | Builds open datasets for AI training, used by many LLMs. It honors robots.txt but crawls broadly across the web. |
Google-Extended (Google) | Collects data for Google’s AI products, like Gemini. It doesn’t affect SEO but can be blocked without impacting search rankings. |
Amazonbot (Amazon) | Indexes content for Alexa’s answers and AI applications. It’s less aggressive but still consumes bandwidth. |
PerplexityBot (Perplexity) | Powers Perplexity’s AI search with real-time data. It’s been criticized for ignoring robots.txt on some sites. |
Crawl4ai (Open Source) | Collects data for RAG and AI searches. Popular in open-source communities, it respects robots.txt but requires explicit blocking. |
Firecrawl (Open Source) | Scrapes data for AI training and searches. It’s lightweight but can strain servers if not managed. |
Docling (Open Source) | Focuses on rich content like PDFs for AI datasets. It’s emerging as a key player in open-source crawling. |