The SearchStax Site Search solution’s Crawler add-on explores the pages of a website beginning at a start URL. It then follows the embedded links in the pages rather than following the hierarchical structure of the website.
The crawl is constrained by three limits:
- The Crawler will not travel outside of the DNS domain specified in the start URL. For instance, if the start URL is “https://my.company.com/bios/”, the Crawler will confine itself to pages within “my.company.com.”
- You can set the “crawl depth” of the run. The Crawler will confine itself to pages that are no more than N links away from the start URL.
- Crawler has configurable Exclusions. These rules prohibit Crawler from crawling pages where the page URL includes explicit substrings. For instance, do not include any page that contains the string “/internal/” in the URL.
Let us emphasize in passing that the exclusion rules are case-sensitive, so “/internal/” will not exclude “/Internal/”.
Exclusions are easy to configure, but it isn’t always obvious what branches of the namespace the Crawler has mistakenly included. Here is one way to get a look at the URLs of the crawled pages. If your Site Search App uses security tokens:
1 | curl -H "Authorization: Token <read-only token>" "https://searchcloud-1-us-west-2.searchstax.com/12345/crawler-1234/select?q=url:*&wt=json&indent=true&fl=url&rows=10&start=1" |
If your Site Search App uses Basic Auth credentials:
1 | curl -u < read -only user>:< read -only password> "https://searchcloud-1-us-west-2.searchstax.com/12345/crawler-1234/select?q=url:*&wt=json&indent=true&fl=url&rows=10&start=1" |
Entered into a Linux terminal window, this /select query returns a list of URLs from the Site Search index, similar to this:
1 2 3 4 5 6 7 8 9 | "response" :{ "numFound" :368, "start" :1, "numFoundExact" : true , "docs" :[ { { { { |
You can adjust the &rows and &start params to see different portions of the list.
Questions?
Do not hesitate to contact the SearchStax Support Desk.