WEBINAR: Live Event Date: September 20, 2017 @ 1:00 p.m. ET / 10:00 a.m. PT
Designing a Proactive Approach to Information Security with Cyber Threat Hunting REGISTER >
In the first installment in this series, we looked at creating a robots.txt file to manage how search engine Web crawlers (or “spiders”) index your site. By default, crawlers will try to index every file they can find on your site, which may not be desirable.
Although the basic syntax we looked at for robots.txt will cover most scenarios, there are some additional ways to manage Web crawlers—adjusting crawl rate, using meta tags, and creating a sitemap.
Take it slow
Every time a Web crawler indexes your pages, it gobbles up bandwidth. Depending on the content of your site, this could add up to a lot of data. And depending on your hosting account, large amounts of crawling could slow down your site for “real” visitors, and may even incur charges from your hosting provider if you exceed your allotted bandwidth.
To be clear, these negative scenarios are not typical. After all, search engines do not want to alienate Web sites, and most Web sites want to be found on search engines. For the average Web site, you don’t need to be concerned with how often your data is crawled.
However, if your site has been negatively impacted by a crawler—for example, you find an excessive amount of hits from a crawler in your Web server logs—you can tell some crawlers to slow down their rate of indexing.
Two major crawlers—Yahoo (“slurp”) and Bing (“msnbot”)—support the “crawl-delay” directive in robots.txt. Using this, you can specify a minimum time between hits:
Using this syntax, a crawler that honors “crawl-delay” will wait at least three seconds between visits. Obviously, the larger value you use, the more slowly your site will be crawled.
Although we’ve set a generic user agent (“*”) that matches all crawlers, only some—such as Yahoo and Bing—will honor the “crawl-delay” instruction.
You’ll notice that we haven’t mentioned the most popular search engine of all, Google. This is because Google does not honor “crawl-delay.” However, you can use an alternate method to adjust Google’s crawl rate. You must create a Google Webmaster account, add your site to this account, and set the crawl rate value in your site configuration settings. Also note that Google will honor this setting for only 90 days.
Whether or not you use a robots.txt file, there is actually another way to issue directives to Web crawlers—using meta tags.
To use meta tags, it helps to have a basic familiarity with HTML. If you’ve ever coded or looked at the code behind a Web page, you may have seen meta tags. These must be placed within the <HEAD> section of an HTML document. For example:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Above, the meta tag is directed at “robots,” which is the equivalent of the wildcard “*” character in robots.txt. Instructions to the crawler are contained in the “CONTENT” attribute, using a comma to separate multiple commands.
Since crawlers will, by default, index your page, you would use meta tags only to restrict access. Supported commands include:
NOINDEX: The Web crawler will not index this page.
NOFOLLOW: The Web crawler will not follow hyperlinks in this page to find more pages to index.
NOARCHIVE: No directly relating to indexing, but this will prevent search engines, such as Google and Bing, from storing a cached copy of the page.
By default, crawlers will index the page and follow links, so actually there is little reason to include those commands except for clarity when reading the source code.
Although the meta tag syntax is a very limited way to control crawlers, it can be a convenient way to prevent a crawler from indexing specific pages without needing to itemize them all in robots.txt.
A third way to guide the Web crawler through your site is by using a sitemap. Simply put, a sitemap is a list of files on your site that you want the crawler to index. This file can consist of either a simple text file with a URL on each line, or an XML file with which you can provide some supplemental information about each page.
Note that a sitemap does not prevent the crawler from visiting pages that are not listed in the sitemap—the crawler will attempt to index any pages it finds through hyperlinks, aside from those barred by existing robots.txt rules. However, a sitemap will help the crawler find pages that may not be linked from elsewhere on your site, and if you use the XML format, you can hint to the crawler which pages are more important than others.
A plain text site map is quite simple—suppose we created a file called sitemap.txt:
There is a maximum limit of 50,000 URLs and a total 10MB in size for the sitemap.
To use the sitemap, you add it to your robots.txt file using this instruction:
Note that you can add this line anywhere in your robots.txt file; it is not dependent on the “User-agent” section.
Alternatively, you can provide more information about your pages using an XML sitemap. Although using XML is more complicated than plain text, it is relatively simple to follow by example:
<?xml version='1.0' encoding='UTF-8'?>
Everything outside the <url>…</url> tags is obligatory code that you can simply cut-and-paste into a text file. The real meat of the sitemap is the <url> section—you must create one for each link you want to include in the sitemap.
In the above example, there are two links, page1.html and page2.html. For each link you can specify three additional pieces of metadata:
<lastmod>: Date that the document was most recently updated.
<changefreq>: How often the page content changes—acceptable values include always, hourly, daily, weekly, monthly, yearly, never.
<priority>: Set to a value between 0 and 1, this tells the crawler how important the page is relative to other pages in your site.
Note that none of these values will affect the ranking of your pages in search engines, and only serve to help the crawler figure out which pages in your site are the most relevant when it turns up in search results.
As with the plain text sitemap, you need to add a line to robots.txt to include your sitemap:
If your sitemap file is very large, you can compress the .txt or .xml file in gzip format.
Aaron Weiss is a networking expert and Wi-Fi enthusiast based in upstate New York.