Tools

AI SEO – Robots.txt, Markdown, and How AI Providers are Crawling Your Sites

Explore how InMotion Hosting’s new AI SEO Helper helps websites stay visible in evolving AI-driven search patterns. Learn how to prepare your site for LLM crawlers and future-proof your SEO strategy.

Written by:
Todd Robinson

Please note: this article documents a vision of a product and a standard we see emerging in the market. It is intended to help both customers and ourselves understand how to respond to and leverage the power of new AI systems and evolving search patterns. It’s a work in progress! With that, our announcement.

We are launching a new service to help our customers and other professional website managers navigate the changes brought on by AI providers increasingly handling search queries. We use a process ourselves that we want to share to help ensure your site is AI-ready. For now, we’re calling it the InMotion AI SEO Helper.

In this post, I will refer to both our website and a set of anonymized websites. As a hosting company, we can see aggregate patterns across many sites and those patterns closely match what is happening on the inmotionhosting.com website.

You will be able to use a partial version of the AI SEO Helper right from our website at inmotionhosting.com/services/ai-seo-helper to get an idea of how it works. If you need more than what that provides, you will need to sign up, for free, to use the full AI SEO Helper. Please note that in times of resource contention, our customers have first priority in the system.

The tool will check your website and will (current plan) do the following at Version 2. Version 1 will have a subset, of course:

  • Ensure the site has a robots.txt file and identify what is missing
  • Ensure the site has a sitemap.xml and identify what is missing
  • Check for the presence of .md files
  • Check whether the site includes a llms.txt file* (see note below about the caveat here)
  • Verify that the site is not unintentionally blocking LLM crawlers

As mentioned above, the tool identifies what may be missing. At this point, it is not 100% known what needs to be done as it is an evolving standard.

Our view of “what should be done” to help crawlers for the AI tools are based on our ongoing experience. We’ll link to supporting resources as they’re published, so pardon the lack of links for now.

 

Crawling, Training, Searching – Plus New Sales

Let’s start with this: sales are already coming in from these new search patterns. People are going to their favorite AI chatbot, doing research with the intent to purchase, and coming to our sites to complete the purchase. This is a fact that I have personally seen myself. The pattern is not exactly understood yet and it is also not clear how much of that purchase flow will shift from Google searches to ChatGPT and similar.

The information below outlines what we’re seeing. I am not talking about if websites, papers, books, etc. should be used to train the LLMs without the LLMs giving attribution on what it was trained on. I do have my views on it that I will publish another time as that is a legitimate concern. For this discussion, I am talking about websites that already specifically have accepted Google and its peers will crawl and ingest their information for purposes of sending visitors to their site for monetary gain.

Crawling of sites is happening now by many “AI companies”. Several major players, including OpenAI and Anthropic, have provided guidance on how they respect robots.txt and what their User-Agent will present as to your web server. We’ve observed this activity in server logs.

What is not clear is if there will be a different pattern between crawls for inclusion in Training data sets versus crawls due to “right now” information needs. The “right now” information needs are defined as:

  • Parallel Page Crawls – when a user of Anthropic or ChatGPT asks for said service, like Deep Research, to perform searches, the process includes parallel visiting of many pages for the LLM to then evaluate.
  • Recent Data Needed – when a user is seeking information that is not likely to be current in the LLMs working data set, the LLM will check websites on the fly to collect recent information.
  • Specific Request – when a user specifically asks for certain information like a webpage or video to be ingested by the LLM and summarized for usage.
  • Other reasons

“Right now” crawls are happening with a certain level of urgency that manifests itself in rapid parallel page requests to your website. We may wish these services would meter their requests more, but realistically they are trying to meet a user experience goal and speeding up the data collection process is an easy way to help do this.

Either way, when a page is crawled the main purpose is to ingest that page and convert it to a machine ready format. At its most simple, it is converted to Markdown. Markdown is a text based representation of the content of the page, including a text representation of tables and images. There are several popular systems that do this but each crawling tool does it a bit differently though. The open source ones are available for us to evaluate. Ones behind the scenes at services are less obvious, but we expect them to be using one of the popular libraries.

In addition to single page crawls we see crawlers are designed to read the sitemap.xml file. From that, it can then crawl each URL and produce its Markdown file to match. That is typically just a .md file for each one of the crawled pages.

For example, let’s take a page called “about-us”. This could be a static page or a page created by a web app or created server side like WordPress. It has been rendered in the browser though. This page is rich in graphics, colors, layout, images, etc. for a person to read and absorb. For the most common use cases, LLMs need this rich content translated to Markdown for it to absorb easily.

For our system, it will be producing some of these below as public facing URLs with the following likely file structure:

  • /inmotion-ai-helper/openai/directory/about-us.md
  • /inmotion-ai-helper/claude/directory/about-us.md
  • /inmotion-ai-helper/gemini/directory/about-us.md
  • /inmotion-ai-helper/opencrawl/directory/about-us.md
  • /inmotion-ai-helper/crawl4ai/directory/about-us.md
  • /inmotion-ai-helper/docling/directory/about-us.md

As you can see, there are several crawlers out there that are popular. We will cover a few of these in future technical evaluation videos and posts as we go along in our evaluations. The main point though is our plan is to use the individual crawlers to produce a .md specific to it. Then that crawler can simply read that .md file. That will make it much, much faster and will stop each company using this crawler to have to process the same page to the .md file.

On our side, we will watch for major updates of the crawlers and can trigger updates to the .md files occasionally. We are thinking about how often this could be or even if we can let the crawler itself trigger a fresh update of the .md files using some simple API call to our service.

Of note, we will also be working with the crawler providers themselves to see what might help them out.

 

LLMs.txt vs Robots.txt

A bit back the concept of having guidance specific for LLMs be loaded into a new llms.txt file similar to the robots.txt file. The debate now is if a specific file is the right choice. Crawlers are robots and the well written ones already respect the robots.txt. The idea of an llms.txt made sense to me the first time I read about it but after thinking about the issue, it does feel like it is either solved already by the robots.txt or should be solved with some minor additions to the robots.txt.

Here are some examples from our llms.txt on the inmotionhosting.com site. I will stay out of the argument at the moment and let the usage pattern help us. Currently, the amount of access to that file is not really measurable compared to site traffic and robots.txt requests. So currently, let’s call it “not a thing” but we will keep watching it. The idea is right though so hopefully crawlers start respecting one or the other.

Example of InMotion Hosting's LLMs.txt file

 

Intentional or Accidental Blocking of Crawlers

It is important to know if your website is crawlable or not. If you want to block crawlers, this isn’t the post for that. You can check out this page for possible methods, but it is not really possible in the end to cut off access to public content.

For this post, we are focusing on knowing if your pages are crawlable because you want your content in the major LLMs during Training and during “Right Now” lookups. For me, a quick spot check this by just going into my top four AI chatbots and asking it to access a page on our site. If it can’t, we have a problem.

Cloudflare is also trying a few things that I am concerned about. I’ll post more about this and ways to test crawlability.

 

Next Steps and Open Questions

This space is rapidly evolving, and we’re taking an interactive approach. Here are a few questions we’re still working through:

  • Which Markdown outputs should we support?
  • How much of this is already done by the big AI bots? It is likely they are caching the Markdown already for popular sites. Definitely the tools are currently doing site crawls on demand, so for now it matters.
  • Should we think about whether this content should just be hosted by us? ai-helper-cdn.inmotionhosting.com/sitename/openai/directory/filename.md
  • llms.txt – we are tracking this and will include it for now. Later we can either double down or deprecate it if the crawlers stick with the robots.txt
  • When a customer publishes new pages to their site, how often should we audit that and update the .md and .xml files?
  • Should we integrate with a Git-based workflow to make this easier?
  • How can we best support WordPress users? Should this integrate with our Total Cache plugin?

We have a lot to work through, but we wanted to share our direction and raise awareness: sales are already coming in from these tools. They are important already and there will be increased importance for years to come.

Additional Guides & Tools

Blog

Stay updated with the latest web hosting news, tips, and trends. Explore our expert articles to enhance your online presence and keep your website performing at its best.

Explore Our Blog

Support Center

Get 24/7 assistance from our dedicated support team. Access a wealth of resources, tutorials, and guides to solve any hosting issues quickly and efficiently.

Visit Our Support Center

Managed Hosting

Experience high-performance, secure, and reliable managed hosting solutions. Let our experts handle the technical details while you focus on growing your business.

Learn About Managed Hosting

Subscribe to get our latest website & hosting content right in your inbox:

Launching Your Website Is Easier Than You Think

Explore Hosting

Chat live with a Web Hosting sales expert