Downloading a website can take considerable time and sometimes you might wish to quickly check the structure of a site in order to fine tune which documents to copy and which to exclude. WebCopy's Quick Scan functionality can help with this.
Quick scan functionality is deliberately restricted to limit the ability to scan an entire website - use the Scan or Download options for unrestricted scanning.
The quick scan currently only processes each page once. For example, page1.html#fragment
, page1.html?valuea=string
, and page1.html?valueb=number
would appear as page1.html
in the results and only the first occurrence would be crawled.
The Maximum depth and Maximum pages per host options in the Quick scan settings group only apply to the quick scan, and do not have any affect for full scans or downloads. In addition, these settings are not saved with the project.
Once the scan is complete, a diagram of all found documents is displayed. Each node is colour coded to show how it will be processed when the website is copied.
Colour Key | Description |
---|---|
Green | The document will be copied |
Yellow | The document is a non-HTML resource located on a different domain to the website being copied, however it will be copied with the website as the Copy all resources option is set |
Red | The document will not be copied |
After the initial scan has complete, additional controls will be displayed allowing you to select how you want the website to be crawled. Changing an option here will automatically update the diagram to indicate how the new mode would affect a copy.
Although you can configure the Limit crawl depth setting, it is temporarily overridden by the Maximum depth field in the Quick Scan Setting group.
Use of the Everything option is not recommended and should only be used on sites which are self contained or where rules are used to explicitly exclude addresses. Use of this option may cause WebCopy to become unstable.
By default, WebCopy won't crawl any domain that doesn't match the primary host. Changing the crawl mode settings allows you to expand the crawl to include sub domains or sibling domains, or linked resources no matter where they are located.
WebCopy includes the ability to specify additional domains that will be included in the crawl. You can easily configure these from the Quick Scan window - right click the root node on the diagram for the domain you wish to include and click the Exclude option to toggle the status. The domain should now be highlighted in green, indicating it will be copied.
If you change your mind, repeat the process and the domain will be excluded.
To include or exclude any page, right click the relevant diagram node and click Exclude to toggle the status.
Click OK to update your project with the specified configuration changes and close the Quick Scan window.