We all want our web pages to rank in search engine result pages, and for this, search engines deploy bots to crawl websites and record information on those sites. A robots.txt is a text file that webmasters use to have better control over the crawling process of websites because it conveys to the bots how they should crawl a website. It tells them which files or folders, or directories are important for a website, and also ignores which are not important enough. It is a part of the robot exclusion standard or robot exclusion protocol (REP).
What is the main purpose of robots.txt?
The main purpose of a robots.txt file is not to overload a website by crawling it too heavily with so many crawl requests because it slows down the site and degrades the user experience. We need to have a clear idea about a couple of crawling terms to understand the importance of a robots.txt file and the benefits it offers.
Crawl Demand: Crawl Demand is how often bots want to crawl a website.
It depends on two factors:
- URL Popularity: URLs that are popular on the internet tend to get crawled more often to keep them fresher in their index.
- URL Staleness: Bots prevent crawling URLs that are stale or not so active.
These two factors impact the crawl demand of URLs.
Undoubtedly, Google does not want to overload a website by crawling it too heavily with so many crawl requests that slow down a website is what we call crawl traffic. Because of that, Google came up with a crawl rate limit that stops bots from making too many requests and slowing down a website.
Together, the crawl rate limit and crawl demand create the crawl budget of a website. The crawl budget is the number of URLs in a website that bots can and want to crawl within a given time frame.
In the crawling process of a website, the crawl budget plays an important role, so we must use it wisely.
Consequently, robots.txt comes as a savior which is a text file that manages crawler traffic to your site and prevents crawl waste. It ensures only important and high-quality pages get crawled and “disallow” pages that you don’t want Google to crawl. This would free up your crawl budget to only crawl the most important pages of your website.
Can we use a robots.txt file to prevent a URL from getting indexed?
Google bots no longer obey the robots.txt directives related to indexing since September 1, 2019. They used to obey it but not anymore. So, we must not use a robots.txt file to stop the search engine from adding a URL to its index. If you want to block your page from getting indexed and appearing in SERP, use other methods, like password protection or noindex.
Is it mandatory to have a robots.txt file for all websites?
No, it is not mandatory to have a robots.txt file for all websites. If a website does not have it, the bots would crawl the website as they normally do.
One must have a robots.txt file if their website has so many web pages. With its help, the server will not be clogged up with so many crawl requests from Google’s crawler and hamper the user experience. So, it would be best to use this text file to manage crawl traffic and avoid crawling unimportant or similar pages.
How does a robots.txt file work?
Suppose a bot is about to crawl a website, for example, http://www.abc.com/home.html. Before crawling, it will go to the main directory to find the robots.txt file and then strips the path components from the URL with its very first slash. For example, for http://www.website.com/home.html, it will replace “home.html” with “robots.txt”, and end up with “http://www.example.com/robots.txt” and crawl the website according to the instructions mentioned in it.
Is it extremely important to put the robots.txt file in the right place?
Yes, it is extremely important to place the robots.txt file in the right place. But, what could be the right place? Well, the right place is a place that is easy to find because it is the first thing that most search engine crawlers look for before crawling websites. Hence, it must be placed in the main directory because it is where they look for it. And, if they will not be able to discover it, they would not scan the entire website for it. On the contrary, they may simply accept that the site does not have a robots.txt file and index everything they find on the website.
Important points that we must remember about robots.txt:
- Always use all lowercase for the filename: “robots.txt”. Robots.TXT is wrong.
- Do not use robots.txt files to hide information because it is publicly available and anyone can see it.
- Never use it to protect sensitive data from getting crawled because it is not mandatory for each search engine to follow it.
- A robots.txt file is like a “Do Not Disturb” sign at the door of your website. Generally, bots follow the instructions in the file, like good people who do not open the door with a “Do Not Disturb” sign. Unfortunately, thieves do not care about these signs at the door. Just like thieves, some bots do not cooperate with the robots.txt file and neglect the instructions given in it. On the contrary, they may begin with the portions of the website from where they are instructed to stay away.
- To see the robots.txt file of a website, all you would need to do is to add “robots.txt” at the end of the domain name, like “http://www.abc.com/robots.txt”. As long as a website has a robots.txt file, a text file will open up, which would be the robot.txt file of that website. All the instructions mentioned in the robots.txt file will show how the search engines are crawling that website. Otherwise, a no .txt page will appear.
Structure of a robots.txt file
The basic structure of a robots.txt file is
User-agent:
Disallow:
Where “User-agent:” contains instructions for a specific robot you want to disallow to crawl, and “disallow:” is to instruct that robot which page of the website it is not supposed to visit or crawl.
Here follow some examples:
- Exclude all bots from the entire website
User-agent: *
Disallow: /
(The ‘*’ in the User-agent field is a special value meaning “any robot” )
(The ‘/’ in the disallow field is a special value meaning “not allowed”) - Give all bots complete access
User-agent: *
Disallow:
(we can create an empty “/robots.txt” file, or we don’t need to use one at all) - Stop all bots from a part of the website
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/ - Exclude a single bot
User-agent: Yahoo!
Disallow: / - Permit a single bot
User-agent: Google
Disallow:User-agent: *
Disallow: /