Search Engine Robots or Web Crawlers

Most ordinary users or visitors use different available search engines to search out the piece of information they require. But how do search engines provide this information? Where have they collected this information? Most of these search engines maintain their database of information. This database includes the sites available in the web world, which ultimately retain the web detail page information for each general area. Search engines do some background work using robots to collect data and maintain the database. They make a catalog of gathered information and then present it publicly or at times for private use.

In this article, we will discuss those entities which loiter in the global internet environment or web crawlers which move around in net space. We will learn

What it’s all about

, and what purpose do they serve?
·

Pros and cons

of using these entities.
·

How can we keep

our pages away from crawlers?
·

Differences

between the standard crawlers and robots.

In the next portion, we will divide the whole research work into the following two sections :

Search Engine Spider:

Robots.txt.
II.

Search Engine Robots:

Meta-tags Explained.

I. Search Engine Spider: Robots.txt

What is a robots.txt file?

A web robot is a program or search engine software that visits sites regularly and automatically and crawls through the web’s hypertext structure by fetching a document and recursively retrieving all the referenced documents. Sometimes site owners do not want all their site pages to be crawled by web robots. For this reason, they can exclude a few of their pages being crawled by the robots by using some standard agents. So most robots abide by the ‘Robots Exclusion Standard, a set of constraints to restrict robots' behavior.
‘Robot Exclusion Standard is a protocol used by the site administrator to control the movement of the robots. When search engine robots come to a site, they search for a file named robots.txt in the site's root domain (http://www.anydomain.com/robots.txt). This plain text file implements ‘Robots Exclusion Protocols’ by allowing or disallowing specific files within the directories of files. The site administrator can disallow access to CGI, temporary or private directories by specifying robot user agent names.

The format of the robot.txt file is straightforward. It consists of two fields: user-agent and one or more Disallow fields.

What is a User-agent?

This is the technical name for programming concepts in the worldwide networking environment and mentions the specific search engine robot within the robots.txt file.
For example :

User-agent: googlebot

We can also use the wildcard character “*” to specify all robots :
User-agent: *

This means all the robots are allowed to come to visit.

What is Disallow?

In the robot.txt file, the second field is known as the disallow: These lines guide the robots to which file should be crawled or which should not. For example, to prevent downloading email.htm, the syntax will be:

Disallow: email.htm

To prevent crawling through directories, the syntax will be:

Disallow: /cgi-bin/

White Space and Comments :

Using # at the beginning of any line in the robots.txt file will be considered as comments only, and using # at the beginning of the robots.txt, like the following example, entails which URL to be crawled.

# robots.txt for www.anydomain.com

Entry Details for robots.txt :

1) User-agent: *
Disallow:

The asterisk (*) in the User-agent field denotes “all robots” are invited. As nothing is disallowed, so all robots are free to crawl through.

2) User-agent: *
Disallow: /CGI-bin/
Disallow: /temp/
Disallow: /private/

All robots can crawl through all files except the CGI-bin, temp, and private files.

3) User-agent: dangerbot
Disallow: /
Dangerbot is not allowed to crawl through any of the directories. “/” stands for all manuals.

4) User-agent: dangerbot
Disallow: /

User-agent: *
Disallow: /temp/

The blank line indicates starting of new User-agent records. Except for danger, all the other bots can crawl through all the directories except “temp” directories.

5) User-agent: dangerbot
Disallow: /links/listing.html

User-agent: *
Disallow: /email.html/

Dangerbot is not allowed for the listing page of the links directory. Otherwise, all the robots are allowed for all manuals except downloading the email.html page.

6) User-agent: abcbot
Disallow: /*.gif$

To remove all files from a specific file type (e.g., .gif ), we will use the above robots.txt entry.

7) User-agent: abcbot
Disallow: /*?

To restrict web crawlers from crawling dynamic pages, we will use the above robots.txt entry.

Note: Disallow field may contain “*” to follow any series of characters and may end with “$” to indicate the end of the name.

E.g., Within the image files, exclude all gif files but allow others from google crawling
User-agent: Googlebot-Image
Disallow: /*.gif$

Disadvantages of robots.txt :

Problem with Disallow field:

Disallow: /css/ /cgi-bin/ /images/
The different spiders will read the above field differently. Some will ignore the spaces and will read /css//cgi-bin//images/ and may only consider either /images/ or /CSS/ ignoring the others.

The correct syntax should be :
Disallow: /CSS/
Disallow: /CGI-bin/
Disallow: /images/

All Files listing:

Specifying every file name within a directory is the most commonly used mistake
Disallow: /ab/cdef.html
Disallow: /ab/ghij.html
Disallow: /ab/klmn.html
Disallow: /op/qrst.html
Disallow: /op/uvwx.html

The above portion can be written as:
Disallow: /ab/
Disallow: /op/

A trailing slash means a lot that a directory is off-limits.

Capitalization:

USER-AGENT: REDBOT
DISALLOW:

Though fields are not case sensitive, data like directories and filenames are case sensitive.

Conflicting syntax:

User-agent: *
Disallow: /
#
User-agent: Redbot
Disallow:

What will happen? Redbot is allowed to crawl everything but will this permission override the disallow field, or will disallow override the allow clearance?

II. Search Engine Robots: Meta-tag Explained:

What is a robot meta tag?

Besides robots.txt, the search engine has other tools to crawl through web pages. This is the META tag that tells web spider to index a page and follow links on it, which may be more helpful in some cases, as it can be used page-by-page. It is also beneficial if you don’t have the requisite permission to access the server's root directory to control the robots.txt file.
We used to place this tag within the header portion of HTML.

Format of the Robots Meta tag :

In the HTML document, it is placed in the HEAD section.
HTML
head
META NAME=” robots” CONTENT=” index, follow.”
META NAME=” description” CONTENT=” Welcome to…….”
title……………title
head
body

Robots Meta Tag options :

Four options can be used in the CONTENT portion of the Meta Robots. These are index, noindex, follow, follow.

This tag allows search engine robots to index a specific page and follows all the links. If the site admin doesn’t want any pages to be indexed or any link to be followed, they can replace “ index, follow” with “ noindex,nofollow.”
According to the requirements, the site admin can use the robots in the following different options :

META NAME=” robots” CONTENT=” index, follow”> Index this page, follow links from this page.
META NAME=” robots” CONTENT =” no index, follow”> Don’t index this page but follow a link from this page.
META NAME=” robots” CONTENT =” index,no follow”> Index this page but don’t follow links from this page
META NAME=” robots” CONTENT =”noindex,nofollow”> Don’t index this page, don’t follow links from this page.