By default, Sitemap Creator will only scan the primary host you specify, for example http://example.com
.
If you need to copy non-HTML resources from other domains (e.g. a CDN), this would normally be automatically handled via the use of the Download all resources option. However, if you wanted to crawl HTML that isn't located on a sub- or sibling-domain, you can configure Sitemap Creator to download HTML from additional domains.
Important
Some project settings are ignored when crawling additional domains, for example the crawling above the root URL.
Configuring additional domains
- From the Project Properties dialogue, select the Additional Hosts category
- Enter each additional host you want to crawl, one host per line. Do not enter protocol or path information, only include the domain name. You can use regular expressions if required.
- Click OK to save your changes. When you next crawl this website, any URLs belonging to the hosts you specify will no longer be skipped, but will be crawled as though they were part of the primary project URL.
Important
If your expression includes any of the ^
, [
, .
, $
, {
, *
, (
, \
, +
, )
, |
, ?
, <
, >
characters and you want them to processed as plain text, you need to "escape" the character by preceding it with a backslash. For example, if your expression was application/epub+zip
this would need to be written as application/epub\+zip
otherwise the +
character would have a special meaning and no matches would be made. Similarly, if the expression was example.com
, this should be written as example\.com
, as .
means "any character" which could lead to unexpected matches.
See Also
Configuring the Crawler
Working with local files
- Extracting inline data
- Remapping extensions
- Remapping local files
- Updating local time stamps
- Using query string parameters in local filenames
Controlling the crawl
- Content types
- Crawling multiple URLs
- Crawling outside the base URL
- Including sub and sibling domains
- Limiting downloads by file count
- Limiting downloads by size
- Limiting scans by depth
- Limiting scans by distance
- Scanning data attributes
- Setting speed limits
- Working with Rules
JavaScript
Security
- Crawling private areas
- Manually logging into a website
- TLS/SSL certificate options
- Working with Forms
- Working with Passwords
Modifying URLs
Advanced
- Aborting the crawl using HTTP status codes
- Cookies
- Defining custom headers
- HEAD vs GET for preliminary requests
- HTTP Compression
- Modifying page titles
- Origin reports
- Overwriting read only files
- Redirects
- Saving link data in a Crawler Project
- Setting the web page language
- Specifying a User Agent
- Specifying accepted content types
- Using Keep-Alive