Google may expand the list of unsupported robots.txt rules in its documentation based on analysis of real-world robots.txt data collected through HTTP Archive.

Gary Illyes and Martin Splitt described the project on the latest episode of Search Off the Record. The work started after a community member submitted a pull request to Google’s robots.txt repository proposing two new tags be added to the unsupported list.

Illyes explained why the team broadened the scope beyond the two tags in the PR:

“We tried to not do things arbitrarily, but rather collect data.”

Rather than add only the two tags proposed, the team decided to look at the top 10 or 15 most-used unsupported rules. Illyes said the goal was “a decent starting point, a decent baseline” for documenting the most common unsupported tags in the wild.

How The Research Worked

The team used HTTP Archive to study what rules websites use in their robots.txt files. HTTP Archive runs monthly crawls across millions of URLs using WebPageTest and stores the results in Google BigQuery.

The first attempt hit a wall. The team “quickly figured out that no one is actually requesting robots.txt files” during the default crawl, meaning the HTTP Archive datasets don’t typically include robots.txt content.

After consulting with Barry Pollard and the HTTP Archive community, the team wrote a custom JavaScript parser that extracts robots.txt rules line by line. The custom metric was merged before the February crawl, and the resulting data is now available in the custom_metrics dataset in BigQuery.

What The Data Shows

The parser extracted every line that matched a field-colon-value pattern. Illyes described the resulting distribution:

“After allow and disallow and user agent, the drop is extremely drastic.”

Beyond those three fields, rule usage falls into a long tail of less common directives, plus junk data from broken files that return HTML instead of plain text.

Google currently supports four fields in robots.txt. Those fields are user-agent, allow, disallow, and sitemap. The documentation says other fields “aren’t supported” without listing which unsupported fields are most common in the wild.

Google has clarified that unsupported fields are ignored. The current project extends that work by identifying specific rules Google plans to document.

The top 10 to 15 most-used rules beyond the four supported fields are expected to be added to Google’s unsupported rules list. Illyes did not name specific rules that would be included.

Typo Tolerance May Expand

Illyes said the analysis also surfaced common misspellings of the disallow rule:

“I’m probably going to expand the typos that we accept.”

His phrasing implies the parser already accepts some misspellings. Illyes didn’t commit to a timeline or name specific typos.

Why This Matters

Search Console already surfaces some unrecognized robots.txt tags. If Google documents more unsupported directives, that could make its public…


Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We blogs.grocliq.com want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at [email protected]

 

 

Categorized in:

Blog,

Last Update: April 23, 2026