The ultimate guide to bot herding and spider wrangling — Part Two

In Part One of our three-part series, we learned what bots are and why crawl budgets are important. Let’s take a look at how to let the search engines know what’s important and some common coding issues.

How to let search engines know what’s important

When a bot crawls your site, there are a number of cues that direct it through your files.

Like humans, bots follow links to get a sense of the information on your site. But they’re also looking through your code and directories for specific files, tags and elements. Let’s take a look at a number of these elements.

Robots.txt

The first thing a bot will look for on your site is your robots.txt file.

For complex sites, a robots.txt file is essential. For smaller sites with just a handful of pages, a robots.txt file may not be necessary — without it, search engine bots will simply crawl everything on your site.

There are two main ways you can guide bots using your robots.txt file.

1. First, you can use the “disallow” directive. This will instruct bots to ignore specific uniform resource locators (URLs), files, file extensions, or even whole sections of your site:

User-agent: Googlebot
Disallow: /example/

Although the disallow directive will stop bots from crawling particular parts of your site (therefore saving on crawl budget), it will not necessarily stop pages from being indexed and showing up in search results, such as can be seen here:

Posted in SEO