The online world is changing, and new rules are emerging to give publishers the power to control if and how AI bots use publicly available content for training purposes. These guidelines, still in the proposal stage, suggest new additions to existing internet standards like the Robots Exclusion Protocol and Meta Robots tags to enable publishers to block AI crawlers from scraping their websites. Crafted by Krishna Madhavan, Principal Product Manager at Microsoft AI, and Fabrice Canel, Principal Product Manager at Microsoft Bing, the proposal seeks to provide publishers with a straightforward way to set permissions for AI training bots.
Why This Matters for Publishers
AI technology has grown rapidly, and AI bots routinely crawl the web to gather data for training, often without specific permissions from content creators. While search engine bots like Google’s and Bing’s honor standard “robots.txt” files, AI training bots do the same only when instructed. For publishers, the prospect of their content being used in training AI models without consent has led to legal concerns and frustrations. This new set of guidelines would allow publishers to use a simple, unified command to control mainstream AI crawlers’ access to their websites.
Introducing Three Key Methods for Blocking AI Training Bots
The draft proposal offers three distinct ways for publishers to restrict or allow access to AI bots. These three methods each build upon existing standards but add an extra layer of clarity for content use specifically for AI training purposes.
1. Updated Robots.txt Protocol
The robots.txt file is a widely recognized tool that allows website owners to control which bots may crawl their sites. This new proposal recommends extending robots.txt to include options explicitly for AI training crawlers. Publishers would have two primary commands to choose from:
– DisallowAITraining – Instructs the AI training bot not to use the website’s data for model training.
– AllowAITraining – Grants AI training bots permission to use the data for training purposes.
2. Meta Robots Tags
Meta Robots tags are HTML elements embedded in web pages that instruct crawlers on what they’re allowed to do. With the proposed additions, publishers could specify AI training permissions directly in their website’s code using:
–
<meta name="robots" content="DisallowAITraining">
–
<meta name="examplebot" content="AllowAITraining">
These tags make it easier to control AI bots without modifying server files, a helpful option for those less familiar with backend web management.
3. Application Layer Response Headers
Application Layer Response Headers function as communication between a server and a client, such as a browser or bot. The new rules propose headers for AI bots:
– DisallowAITraining – Specifies that data shouldn’t be used for AI training.
– AllowAITraining – Permits data use for AI training models.
By setting these headers, publishers can ensure bots receive clear instructions whenever they request data.
The Role of the Internet Engineering Task Force (IETF)
The IETF, a global internet standards organization founded in 1986, has been instrumental in creating voluntary web standards like the Robots Exclusion Protocol. Recently, the IETF formalized these standards to address evolving web needs, and the current proposal builds on this foundation to address new concerns with AI data training.
A Big Step Forward for Publisher Control
Historically, AI companies have defended their right to scrape public data for training under “fair use” laws, similar to traditional search engines. However, lawsuits from content creators and website owners have challenged this. With these new guidelines, publishers gain more specific tools to manage and potentially restrict their data’s use in AI model training.
For those interested, the full proposal is available to read on the IETF website under “Robots Exclusion Protocol Extension to Manage AI Content Use.”