Can I prevent spiders from indexing pages or directories within my site?
Yes. To disallow all spiders from indexing your site, but allow FusionBot to enter, include a robots.txt file (all lower case) in your root directory, or login to your FusionBot account and populate your 'Exclusions Form' (click the 'Spider' tab, then select the 'Exclude Pages & Directories' option), containing the following information at a minimum:
User-Agent: *
Disallow: /
User-Agent: fusionbot
Disallow:
- OR -
User-Agent: *
Disallow: /
User-Agent: fusionbot
Disallow: /cgi-bin/
Disallow: /info/secret.htm
Disallow: /info/brochure.pdf
The initial example above will DISALLOW all other spiders (*) from indexing any directory within your site while DISALLOWING NO directories to the FusionBot spider. The end result is a successful index build for your FusionBot implementation while still preventing other spiders from indexing your content.
The second example is similar, yet shows you how to also prevent FusionBot from indexing only particular directories and pages within your site. In this example, No other spider will index any content, and the FusionBot spider will index all of your content except that which is within the /cgi-bin directory and any of its sub-directories, and the page named 'secret.htm' in the 'info' directory.
It is worth noting, if you have implemented BOTH a robots.txt file AND populated your Robots Exclusion Form in your FusionBot account, FusionBot will combine their contents when determining which pages or directories should be excluded from your index.
Also, there may be times when you include a certain site for indexing within your mini-portal, but want to specify which sections of their site should not indexed. While FusionBot will always adhere to the contents of the actual site's robots.txt file, there may be times when they have not implemented a robots.txt file, and you would like to specify additional / alternate pages and directories that our crawler should omit.
To do this, simply add additional Disallow instructions with absolute URLs in your own robots.txt file or FusionBot Exclusion Form.
For example, assuming you have a site named www.widgets.com as part of your mini-portal, and you wish to exclude the contents of their /cgi-bin, add the following line to the end of your robots.txt file:
User-Agent: fusionbot
Disallow: /cgi-bin/
Disallow: /info/secret.htm
Disallow: /info/brochure.pdf
Disallow: https://www.widgets.com/cgi-bin
You may also implement a robots.txt file for your mini-portal sites without specifying any pages or directories to be omitted within your own site.
For a detailed explanation of how the robots.txt file can be implemented, please visit https://www.robotstxt.org/robotstxt.html.
In addition to adhering to the robots.txt standards outlined in the previous link, FusionBot also extends the functionality of the robots.txt standard by allowing you to populate your DISALLOW statements with wildcards (*). In this manner, you can instruct, for example, the FusionBot crawler
to not crawl any pages on your site that contain a specific value within the pagename / querystring.
For example, you may have various pages that offer a version optimized for viewing in a browser, and another version, optimized for printing. Having FusionBot crawl both of these pages would result in a number
of duplicates / unnecessary pages being crawled and indexed.
In this scenario, many times, the only differentiating characteristic from one page to another, may be an additional querystring variable. For example, the browser optimized page may have a URL such as:
https://www.yoursite.com/products/widget.php
While the print optimized page's URL would appear as:
https://www.yoursite.com/products/widget.php?print=1
Many different pages throughout a site may utilize this same syntax, where "print=1" indicates the "print" version of the same document. In this case, using wildcards in your robots.txt file or FusionBot online exclusions form, you can instruct FusionBot to not crawl
your print only pages, by including the following line:
User-Agent: fusionbot
Disallow: *print=1*
In this manner, any document / link on your site that contains "print=1", anywhere within the pagename / url, will be omitted from your index. Use this same syntax for any documents that contain a common querystring variable, anywhere within the URL, that should be omitted, when present.
Another feature of the robots.txt syntax unique to FusionBot is the use of the Noindex: directive.
The Noindex: directive works in much the same was as using the NOINDEX Meta Tag within the <head> section of a particular page that you DO NOT want indexed, but DO want our spider to follow the links referenced within the page.
The reason you may want to consider using the Noindex: directive in your robots.txt file or Exclusion Form within your FusionBot account is twofold:
- Placing a Noindex: directive in your robots.txt/Exclusions Form allows you to affect multiple pages via a single entry, rather than having to manually tag (modify) EVERY single page on your site that you want to apply the noindex (but follow links) behavior upon.
- Using a traditional Disallow: directive blocks the entire folder or page from being downloaded in the first place. While you may not want a page or folder indexed, what if the only location where you link to additional content on your site is within a page / folder that you've wholesale disallowed? FusionBot will never find this content and therefore it won't be included in the search results when using the disallow directive. The NOINDEX directive will prevent this from happening.
It should be noted, when using the Noindex: directive, that you MUST use wildcard logic for the FusionBot crawler to match upon. As a result, be careful when constructing your directive(s), as you don't want to unintentionally apply this behavior to pages on your site by accident.
Following are a few sample entries along with an explanation of their impact:
User-Agent: fusionbot
Noindex: *print=1*
Noindex: */brochures/*
The first example will allow FusionBot to download all pages that have "print=1" somewhere in their URL, however, the pages will only be analyzed for links to other pages, and none of their content will be searchable / included in your results.
The second example will apply the same behavior as just referenced above, however it will do so for any content that has a "brochures" folder somewhere in the URL, which could be any of the following:
https://yoursite.com/files/brochures/
https://yoursite.com/brochures/
But not:
https://yoursite.com/brochures
https://yoursite.com/brochures.php
The above two URLs won't be affected as they do NOT have the trailing forward slash specified in the NOINDEX: directive.
As you can see via the second example, be VERY careful when constructing your wildcard NOINDEX directives (they MUST be wildcarded), so that you don't end up with unintended affects.
Also, please reference our FAQ concerning details on implementing robots exclusion syntax within the <HEAD> section each page on your site.
<< Previous FAQ | Back to FAQ List | Next FAQ >> |