Crawler Configurations
Overview
The EETS crawler makes indexing your site data quick. Setup the target URL and let it go. The crawler averages an indexing speed of about 20 pages per minute, generating quick search results. Actual crawl speeds are dependent on page size, content complexity, and server response rates.
Targeting Your Site
Define one or more sites to crawl by entering URLs. The crawler will start with the site(s) entered and will follow links until all references are exhausted. In other words, as long as you can navigate to your page using your mouse on your site, our crawler will find it and index it.
The crawler will ignore any links leading outside the domain in the target URL(s) given to prevent indexing a number of unwanted pages. To have multiple domains crawled, specify a target page from each domain.
Sitemaps are valid targets.
Crawler User Agent
Crawler Frequencies
By default, the rate at which a crawler will request a page is set to 'auto.' This configuration uses an algorithm to set frequencies to recrawl pages based on how often their content changes.
There are also options to set a regular crawl interval.
10 minutes
1 hour
1 Day
2 Days
3 Days
4 Days
5 Days
7 Days
14 Days
30 Days
Rate Limit
The rate limit of the crawler is a maximum of 1 page per second.
Handling HTTP Codes
The crawler has a procedure to handle HTTP response codes:
2xx
Proceed with crawl
3xx
Crawler will follow the redirect and index the final page
4xx/5xx
Does not crawl
Types of Pages Crawled
The crawler will extract html content from any standard HTML webpage and hosted PDF files, including datasheets and part catalogs. This behavior is standard for all engines and cannot be configured.
CSS Targeting
Use CSS selectors to guide the crawler to or away from on-page content. Only one selector may be used for targeting content. This configuration gives a hint to the crawler for the location of page content to avoid crawling headers, footers, navigation, etc.
Better results can often be yielded by excluding certain elements, such as headers, footers, navigation, etc. There is no limit on exclusion selectors.
CSS selectors include HTML elements, ID's, and classes. Inputs must use the appropriate indicating character prefix for each type of selector
no prefix for HTML elements
pound character for ID's
period character for classes
Header
Related Products
Footer
Other Products
Forms
Common Information Across Pages
Breadcrumbs
Side Bar Tools & Navigation
Key Property
A page's URL is the key indexing property. Multiple pages with the same title and content could both be indexed so long as the URL varies.
Pages that include dynamic content often have changing parameters that are reflected in the URL parameters. This may result in unwanted soft duplicate pages being indexed and returned to users in the SERP. Solve this by using CSS Selectors to exclude dynamic content from the crawler or use URL Filtering to block all but one variation of the page from being crawled.
Limitations
The crawler cannot index content from images or dynamically generated pages that do not have unique URL's.
The crawler also does not have the ability to pick up meta tag content.
Last updated