For hackers looking to access passwords, user e-mails, retail transactions, and other private data, one of the most useful tools on the Web is also among the most popular—Google. Armed only with the world’s most popular search engine, and crafty search terms, a Google search can turn up troves of data whose owners are probably not looking for public exposure. Yet, technically speaking, neither Google nor those running these searches are doing any “hacking” at all.
Like most search engines, including Bing and Yahoo, Google continuously updates its index using automated software known as a “crawler” or “spider.” The crawler starts at the entrance page to a Website and then follows any and all links in that site, building an index of all the content it finds in the process. But if documents containing sensitive information are in the crawler’s path, it could index those, as well. Often, Website owners simply don’t realize that this data is available to Google.
Without linking to them here, numerous hacker-oriented Websites demonstrate Google queries rigged to find password files, documents including Word docs and PDFs, and much more. Now for the good news: you can and should protect your sensitive data from search engine snooping.
All major search engine crawlers are designed to respect properly coded “KEEP OUT” signs. You can tell the crawlers to keep out of only selected areas of your site, or to skip indexing specific files. All of this can be done using the “robots exclusion standard” more commonly referred to as “robots.txt”, because the instructions are contained in a file named robots.txt. (Search engines prefer to think of their indexing software as a robot rather than a type of arachnid.)
Meet robots.txt
Before a search engine crawler (or spider or robot depending on your preferred terminology) visits your Website, it has to find it in the first place. This typically happens one of two ways. One is that you tell it to—search engines, such as Google and Bing provide pages where you can submit the URL to your Website for indexing. The second way is when another site, which is already indexed by the search engines, links to your site.
Either way, when the search engine indexer crawls its way into your site the very first thing it looks for is a file called robots.txt. This file must exist in the top-level folder of your site because that is the only place the crawler will look. If this file does not exist, the search engines interpret this as permission to index everything—in other words, “door wide open.”
You can create the robots.txt file either manually, using a simple text editor, or using generator software that produces the syntax for you. Most robots.txt files are rather simple, so it makes sense to become familiar with the syntax, even if you ultimately decide to use a generator tool.
Basic syntax
Because the syntax for writing a robots.txt file is so simple, the easiest way to learn is by example. Let’s look at perhaps the simplest syntax of all:
User-agent: *Allow: /
The first command, “User-agent,” identifies which search engine crawlers the following rules apply to. Each search engine crawler has a unique “user-agent” or name. For example, the Google crawler’s user-agent is named Googlebot. Microsoft Bing’s user-agent is MSNBot, and Yahoo’s is named Slurp. While there is no single best source listing all search engine user-agents, this list includes most major search engines.
The asterisk (*) syntax is a wildcard meaning “all.” And the single slash (/) means the top-level folder of your site. So, the above example translates to “all search engines are allowed to index the whole site.”
Of course, this rule is the equivalent of having no robots.txt at all. But with one small change we can slam the door shut on all search engines:
User-agent: *Disallow: /
With this example, all search engines are told that they cannot index anything in your Website. It is very important to understand what “all search engines” really means—all search engines who respect robots.txt. This does include all major search engines, but there’s nothing preventing a rogue search engine from simply ignoring these rules. More on this later.
Getting specific with syntax
Of course, it is rare to exclude search engines from your site entirely. After all, search engines can bring traffic, and most sites want traffic. It is also unusual to specify a User-agent more specific than the wildcard, because most rules you write will apply to all search engines. In most cases, you want to exclude search engines from only portions of your site, or particular files. Consider several examples:
User-agent: *Disallow: /private/Disallow: /private.htmlDisallow: /private.pdf
Above, all user agents will be blocked from indexing any files within the private/ subfolder. They will also be blocked from indexing the single file private.html, plus the single file private.pdf—assuming these files reside in the top-level folder.
Suppose we want to block private.pdf anywhere, no matter which folder it resides in:
User-agent: *Disallow: */private.pdf
Here we’ve introduced a wildcard character into the disallow rule. Again, the asterisk means “anything” (or nothing). So this rule will match private.pdf in the top folder of the site, or within any subfolder, including nested subfolders. Support for wildcards in allow/disallow rules is not universal among all search engines, but the major three (Google, Microsoft, and Yahoo) now do support wildcards.
When the search engine compares your rules to the files in your site, it looks at the filename from left to right. The above disallow rule would also block indexing a file named private.pdfx, because “private.pdf” is contained in this filename when looked at left to right. You can use another special character—the dollar sign ($)—to make your rule even more specific:
User-agent: *Disallow: */private.pdf$
The dollar sign means “end of match,” so that this rule would disallow indexing of “private.pdf” but allow indexing of “private.pdfx”.
You can mix Allow and Disallow rules to create a more sophisticated filter:
User-agent: *Disallow: */*.doc$Allow: */public.doc$
The above rule will block indexing of any files that end with a .doc extension, except for files named public.doc anywhere in the folder hierarchy, which will be indexed. Note that the robots.txt rules are case-sensitive, meaning that the above rules would not index a file named Public.doc.
Syntax generators
You can find (using a search engine!) quite a few “generators” online that you can use to produce a robots.txt file for you. Unfortunately, many of these have poorly-designed interfaces and aren’t much easier to use than simply writing the code yourself. (Plus, some are hosted on ethically questionable sites.) One important exception to this is Google Webmaster Tools. To use Google’s robots.txt generator, you must have a Google account, login to their Webmaster tools site, and add a site to your portfolio. You can then click “Crawler Access” in the “Site Configuration” menu, which provides two very useful tools: “Test robots.txt” and “Generate robots.txt”.
Use the generator tool to construct the rules, and Google will spit out the syntax (like you’ve seen in this tutorial), which you can cut-and-paste into your real robots.txt file. Even more useful is the testing tool. Using this, you can write new rules on-the-fly and then input an example URL to see how it would be filtered by the rules (the URL need not exist for real).
Maximum security
Truth be told, the most secure files are those that are not accessible at all. In other words, if you have sensitive documents within your Website, they are potentially vulnerable to being snooped. Remember that, as we noted earlier, robots.txt may not be obeyed by rogue search engines. So what good is it? In practice, only the major search engines have the resources to crawl large amounts of the Web, so keeping their hands off your private data does address a big chunk of potential exposure.
Sometimes you have sensitive data that has to be kept within your Website, because you or others may access it via the Web. Using a properly coded robots.txt file will prevent the information from that file winding up in public search results. But to take your security a step further, consider placing private Web-accessible files behind a password-protected folder. Precisely how to password-protect folders depends on your Web server and goes beyond the scope of this article. But when a folder is password protected, no search engine will be able to index the files within, regardless of robots.txt.
Aaron Weiss is a networking expert and Wi-Fi enthusiast based in upstate New York.