5 things all newbies need to know about crawlers and robots.txt files

Difficulty: Intermediate

Spend some time on the internet and it won’t be long until you hear mention of “robots” “Crawlers” and “Spiders”. But what exactly are they, what do they do, and do you want them on your site?

Robots, Crawlers, and Spiders are all interchangeable names referring to the same thing. Put simply, these internet critters are web software that trawl the internet automatically, in order to index website content. There are thousands of web crawlers trawling the internet every day, and they all have many different uses. Most commonly, search engines such as Google use Robots to index websites and organise their rankings. These search engine crawlers crawl the web to find new pages, content and sites so that search results pages are kept up to date. However, not all robots are good. Spammers can use them for malicious intent to find out user and customer information such as email addresses.

Everything on the internet will eventually be crawled. You want Search Engine Spider to be able to find your website and pages. Without them, you will never appear on Google or other search engines. However there may be circumstances when you may not want everything to show up on Google. For example, you might have content or pages on websites which you do not want visible to a search query. That's where robots.txt files come in.

Robots.txt files are instructions for crawlers and control how crawlers interact with your website. If there are pages on your website that you don’t want robots to be able to access and crawl, your robots.txt file communicates this. A robots.txt file is the first thing a Crawler looks at when it visits your page.

However, you must also keep in mind that your robots.txt file is not definitive. Those bad Malware robots used by hackers we previously mentioned for example, can ignore your robots.txt instructions. Additionally, robots.txt files are publicly available; anyone can find your robots file and you can find anyone elses. Therefore, a robots.txt file should never be used to hide information, only to provide directions to crawlers.

Here are 5 essential things you need to know about robots.txt files

1. How to find your robots.txt file

You can check if any website has a robots.txt file by simple adding “/robots.txt” to the end of a URL.

2. What to put in a robots.txt file

The syntax of a robots.txt file is very straight forward. The most basic file includes two main instructions: User Agent and Disallow.

The User Agent command directs which web crawlers listen to your instructions, and it is possible to design your Robot.txt rules to only address specific crawlers. For example, using the syntax “User Agent: Googlebot” will instruct only Google to adhere to the direction rules set out in your file.

The User Agent directive should be in the first line of your robots.txt code in order to address which crawlers you want to listen to your instructions. If you want your robots.txt file to be a blanket command to all crawlers and spiders that may come on to your website without having to specify each bot in turn, add an asterisk “*” to the command.

User agent - * = all robots

The next entry into your robots code is the “Disallow” command. This instructs the robots that the url path should not be crawled and you want it to be blocked. In this command, you can include as many urls as you like and the instruction will be applied to all crawlers specified above.

The chart below from Google, gives a fantastic guide to how to write different types of disallow commands:

If there is a subdirectory URL that you do want visible within a disallowed parent url, the “Allow” command lets you unblock specific locations of your webpage.

These directives (user agent, disallow, allow) are classed as a single entry into the robots.txt file. You can include as many additional entries as you need, for example, targeting specific crawlers and outlining their restrictions in turn.

Remember that url paths are case sensitive. The capitalisation of your allow and disallow directives must exactly match the url, otherwise your command will not be followed by a crawler. Additionally, the names of the bots directly addressed in the user agent must be correctly capitalised.

3. What NOT to put in them

Do not try and use a robots.txt file to block sensitive data from being accessed by crawlers. Password protection should always be used as a way to restrict access of secure data. Crawlers can ignore the directives in a robots.txt file and malicious bots can access disallowed areas of your site.You should never instruct a full disallow of your website:“ Disallow: / “. This means that your entire site will be blocked to crawlers and spiders. As a result, none of your pages will be indexed and your website will never on any search engine.

4. How do you make and code robots.txt files?

There are many tools available online to help you create your robots.txt file. These free online resources are fantastic if you're unsure about coding or what exactly to put. Furthermore, Google Search Console also includes a robots.txt generator. However, you can of course create a robots.txt file yourself as a text file on your computer.Your robots.txt file must be saved as a text file format and must be called robots.txt.

5. Where to put them on a website

Your robots.txt file must be located at the highest level directory of your site in the root of a domain.

For example your file must be accessible via, yourwebsite.com/robots.txt, not yourwebsite.com/images/gallery/robots.txt.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

hello world! - this is in 'Singular: Tip' template
Shares