The robot exclusion standard is nearly 25 years old, but the security risks created by improper use of the standard are not widely understood. In this blog post I will explain the risks, confusion about the purpose of the standard, and how to properly use the standard to avoid risk and keep your sensitive data protected.
What is the robots exclusion standard and what is a robots.txt file?
The robots.txt file is used to tell web crawlers and other well-meaning robots a few things about the structure of a website. It is openly accessible and can also be read and understood quickly and easily by humans. The robots.txt file can tell crawlers where to find the XML sitemap file(s), how fast the site can be crawled, and, most famously, which web pages and directories not to crawl.
Before a good robot crawls a web page, it first checks for the existence of a robots.txt file and, if one exists, respects the directives found within. The robots.txt file is one of the first things new SEO practitioners learn about. It seems easy to use and powerful. This set of conditions unfortunately results in well-intentioned but very high-risk use of the file.
In order to tell a robot not to crawl a web page or directory, the robots exclusion standard relies on “Disallow” declarations – in which a robot is “not allowed” to access the page(s).
The robots.txt security risk
The robots.txt file isn’t a hard directive, it is merely a suggestion. Good robots like Googlebot respect the directives in the file. Bad robots, though, may completely ignore it or worse. In fact, some nefarious robots and penetration test robots specifically look for robots.txt files for the very purpose of visiting the disallowed site sections.
If a villainous actor – whether human or robot – is trying to find private or confidential information on a website, the robots.txt file’s disallow list can serve as a map. It is the first, most obvious place to look. In this way, if a site administrator thinks they are using the robots.txt file to secure their content and keep pages private, it is very likely they are doing the exact opposite.
There are also many cases in which the files being excluded via the robots exclusion standard are not truly confidential in nature, but it is not desirable for a competitor to find the files. For instance, robots.txt files can contain details about A/B test URL patterns or sections of the website which are new and under development. In these cases, it might not be a true security risk, but still there are risks involved in mentioning these sensitive areas in an accessible document.
Best practices for reducing the risks of robots.txt files
There are a few best practices for reducing the risks posed by robots.txt files.
1) Understand what robots.txt is for – and what is isn’t for
The robots exclusion standard will not help to remove a URL from a search engine’s index, and it won’t stop a search engine from adding a URL to its index. Search engines typically add URLs to their index even if they’ve been instructed not to crawl the URL. Crawling and indexing URL are distinct, different activities, and the robots.txt file does nothing to stop the indexing of URLs.
2) Be careful when using both noindex and robots.txt disallow at the same time
It is an exceedingly rare case in which a page should both have a noindex tag and a robot disallow directive. In fact, such a use case might not actually exist. Google used to show this message in the results for these pages, rather than a description: “A description for this result is not available because of this site’s robots.txt”. Lately this seems to have changed to “No information is available for this page” instead.
3) Use noindex, not disallow, for pages that need to be private yet publicly accessible
By doing this you can ensure that if a good crawler finds a URL that shouldn’t be indexed, it will not be indexed. For content with this required level of security, it is OK for a crawler to visit the URL but not OK for the crawler to index the content. For pages that should be private and not publicly accessible, password protection or IP whitelisting are the best solutions.
4) Disallow directories, not specific pages
By listing specific pages to disallow, you are simply making it that much easier for bad actors to find the pages you want them to not find. If you disallow a directory, the nefarious person or robot might still be able to find the ‘hidden’ pages within the directory via brute force or the inurl search operator but the exact map of the pages won’t be laid out for them. Be sure to include an index page, a redirect, or a 404 at the directory index level to ensure your files aren’t incidentally exposed via an “index of” page. If you create an index page for the directory level, certainly do not include links to the private content!
5) Setup a honeypot for IP blacklisting
If you want to take your security to the next level, consider setting up a honeypot using your robots.txt file. Include a disallow directive in robots.txt that sounds appealing to bad guys, like “Disallow: /secure/logins.html”. Then, setup IP logging on the disallowed resource. Any IP addresses that attempt to load the “logins.html” should then be blacklisted from accessing any portion of your website moving forward.
The robots.txt file is a critical SEO tool for instructing good robots on how to behave, but treating it as if it were somehow a security protocol is misguided and dangerous. If you have web pages which should be publicly accessible but not appear in search results, the best approach is to use a noindex robots tag on the pages themselves (or X-Robots-Tag header response). Simply adding a list of URLs intended to be private to a robots.txt file is one of the worst ways of trying to keep URLs hidden and in most cases it results in exactly the opposite of the intended outcome.