Crawler Configurations

Overview

The EETS crawler makes indexing your site data quick. Setup the target URL and let it go. The crawler averages an indexing speed of about 20 pages per minute, generating quick search results. Actual crawl speeds are dependent on page size, content complexity, and server response rates.

Targeting Your Site

Define one or more sites to crawl by entering URLs. The crawler will start with the site(s) entered and will follow links until all references are exhausted. In other words, as long as you can navigate to your page using your mouse on your site, our crawler will find it and index it.

The crawler will ignore any links leading outside the domain in the target URL(s) given to prevent indexing a number of unwanted pages. To have multiple domains crawled, specify a target page from each domain.

Sitemaps are valid targets.

Crawler User Agent

Mozilla/5.0 (compatible; bz-crawler/1.0; +http://www.bigzeta.com/keyword-search)

Crawler Frequencies

By default, the rate at which a crawler will request a page is set to 'auto.' This configuration uses an algorithm to set frequencies to recrawl pages based on how often their content changes.

There are also options to set a regular crawl interval.

Rate Limit

The rate limit of the crawler is a maximum of 1 page per second.

Handling HTTP Codes

The crawler has a procedure to handle HTTP response codes:

Types of Pages Crawled

The crawler will extract html content from any standard HTML webpage and hosted PDF files, including datasheets and part catalogs. This behavior is standard for all engines and cannot be configured.

CSS Targeting

Use CSS selectors to guide the crawler to or away from on-page content. Only one selector may be used for targeting content. This configuration gives a hint to the crawler for the location of page content to avoid crawling headers, footers, navigation, etc.

Better results can often be yielded by excluding certain elements, such as headers, footers, navigation, etc. There is no limit on exclusion selectors.

CSS selectors include HTML elements, ID's, and classes. Inputs must use the appropriate indicating character prefix for each type of selector

  • no prefix for HTML elements

  • pound character for ID's

  • period character for classes

The goal of targeting and excluding content is to index only the data that you want searchable for your page. The data crawled is indexed and queried against by users - if some indexed keywords don't match the page, then results relevance to the query may vary, impacting the overall search experience.

Key Property

A page's URL is the key indexing property. Multiple pages with the same title and content could both be indexed so long as the URL varies.

Pages that include dynamic content often have changing parameters that are reflected in the URL parameters. This may result in unwanted soft duplicate pages being indexed and returned to users in the SERP. Solve this by using CSS Selectors to exclude dynamic content from the crawler or use URL Filtering to block all but one variation of the page from being crawled.

Limitations

The crawler cannot index content from images or dynamically generated pages that do not have unique URL's.

The crawler also does not have the ability to pick up meta tag content.

Last updated