If you've been following Google's recent updates, you'll likely already be aware that they've made a few robots.txt-related announcements. There are a few different components to these updates, so we wanted to break down what they are, why they matter, and how they affect you.
Definitions cheat sheet
You may find it helpful to familiarize yourself with these definitions before diving in!TermDefinitionRobots Exclusion Protocol (REP)Created by Martijn Koster in 1994 to tell crawlers which parts of a website should and should not be accessed.Internet StandardDefines protocols and procedures for the internet.Internet Engineering Task Force (IETF)An open, international community of individuals dedicated to the smooth operation of the internet. They produce technical documents that describe Internet Standards.Open SourceNot proprietary; code that is freely available and can be redistributed or modified.Request for Comments (RFC)Documents authored by engineers and computer scientists that describe methods and concepts, often for the purpose of being adopted by the IETF as an Internet Standard.
Google wants to make REP an official internet standard
On July 1, 2019, Google announced that they had worked together "with the original author of the protocol, webmasters, and other search engines" to document how the REP should be used on the modern web so they could submit it to the IETF and get it approved as an official Internet Standard.The draft they created doesn't change the original REP rules, but drawing from 20 years of real-world experience with robots.txt, they did outline specific scenarios and made it applicable for the modern web.Why is this significant? A few reasons:
- Robots.txt is used by so many websites (~500 million!) that it's frankly strange that it's not already an official internet standard. It's great to see Google prioritizing this.
- Making REP an official standard will help clear up the confusion about what robots.txt can and can't do. This documentation will make it much easier for SEOs and developers to find the information they need about how to create a robots.txt file that suits their needs.
- Google took the REP documentation beyond the basics, adding specific common scenarios that will make it all the more straightforward to figure out the right way to do things.
This news doesn't change anything about how robots.txt files should be formatted, but rather gives clearer direction.View the IETF spec here.
Google makes its robots.txt parser open source
On the same day as the REP news, Google announced that it's robots.txt parser is now open source. They explained that, while attempting to make REP an internet standard was an important step, it also meant extra work for developers who parse robots.txt files. In response, Google open sourced the library that they use to parse robots.txt files.Why is this significant? A few reasons:
- The open source robots.txt package includes a testing tool that helps you test your robots.txt rules.
- This is the same code used by Google's crawler to determine which URLs it can access, so it will help developers build tools that better reflect Google's robots.txt parsing and matching (rather than our best guess as to how Google reads these files).
- Google said that this "paves the road for potential search open sourcing projects in the future." The future of search is looking a lot more transparent!
Want the open source robots.txt parser? Find it on GitHub!
Google ditches unsupported robots.txt rules
The very next day, July 2, Google released more information on robots.txt. This time the update focused on unsupported rules. They said that open-sourcing their parser library allowed them to take a closer look at how robots.txt rules were being used, specifically focusing on usages that weren't supported by the internet draft. Those included:
- Crawl-delay
- Nofollow
- Noindex
An example of a robots.txt file with a noindex rule.They found that, when rules like noindex were used in robots.txt files, they contradicted other on-site rules "in all but 0.001% of all robots.txt files on the internet." These types of conflicting signals can affect a website's performance in search results in ways webmasters never intended.So since unsupported robots.txt rules often contradict other rules, and in preparation for future open source releases, Google is retiring all code that handles unsupported and unpublished rules on September 1, 2019.Why is this significant? A few reasons:
- If you relied on your robots.txt file to noindex pages or sections of your site, that option will no longer work as of September. Your new options are (use one of the following): switch to using a noindex in your meta tags, remove the page and serve up a 404/410 status code, password protect those pages, use robots.txt to disallow search engines from crawling those pages, or use the Google Search Console URL removal tool.
- Using "disallow" in your robots.txt can prevent crawling, but if a page is still linked to, it may get indexed even with this directive. With this announcement, Google has said they're looking to make disallowed-yet-indexed pages "less visible in the future."
- Many SEOs relied on robots.txt noindex as a band-aid solution when working with clients whose platforms or development resources didn't allow for easy noindexing. Without this option, some organizations may now be forced to deal with their larger platform or resource problems.
If you're using this unsupported solution, we recommend monitoring activity on your robots.txt noindexed pages in September. For example, if you used robots.txt to noindex /forum*, you can use Botify to monitor page activity specifically in that segment (active pages being those that have generated at least one organic visit within the last 30 days).If you'd like to learn more about how Botify can help you monitor your site after changes like these happen (and they happen often!), book a demo with us. We'd love to show you around!