Sitemap Creator requires two user provided pieces of information before a website can be crawled, the first being the primary address to copy and the second the location where to store downloaded materials.
The crawl process can be configured in many ways, from the use of technical settings for controlling the HTTP protocol to using rules for control what content is download and what is ignored. The following topics detail these options.
Important
Web crawling is not an exact science and while the default crawl settings should work for many websites, some customisation and knowledge of how the website to be copied is structured and built may be required
To display the project properties dialog
- From the Project menu, click Project Properties
.
Configuring the Crawler
Working with local files
- Extracting inline data
- Remapping extensions
- Remapping local files
- Updating local time stamps
- Using query string parameters in local filenames
Controlling the crawl
- Content types
- Crawling above the root URL
- Crawling additional hosts
- Crawling additional root URLs
- Including sub and sibling domains
- Limiting downloads by file count
- Limiting downloads by size
- Limiting scans by depth
- Limiting scans by distance
- Scanning data attributes
- Setting speed limits
- Working with Rules
JavaScript
Security
Modifying URLs
Advanced
- Aborting the crawl using HTTP status codes
- Defining custom headers
- Following redirects
- HEAD vs GET for preliminary requests
- HTTP Compression
- Modifying page titles
- Origin reports
- Overwriting read only files
- Saving link data in a Crawler Project
- Setting cookies
- Setting the web page language
- Specifying a User Agent
- Specifying accepted content types
- Using Keep-Alive