At the most basic level a rule is a pattern to match a given URI and then one or more action flags. Currently WebCopy only supports regular expressions for the pattern.
When a website is crawled, for each detected URI all enabled rules are ran. If the pattern for a rule is a match for the URI, then the action flags are processed. This could mean that the URI is excluded from the crawl, or that additional processing is made on the URI.
If a rule fails to execute, for example due to an invalid regular expression, the rule will be automatically disabled to allow the remainder of the site to copy.
Option | Description |
---|---|
Expression | A regular expression with is matched against a URL while analysing or copying a website. If a match is found, and the rule is enabled, then the attributes below are processed. |
Exclude | Specifies that the URL should be excluded |
Crawl Content | Specifies that although the URL is excluded, its contents should still be scanned |
Include | Specifies that the URL should be included |
Don't Crawl Content | Specifies that although the URL is included, its contents should not be scanned |
Use Full URI | By default, the pattern is only matched on the path and query string of the URL. If this option is specified, the pattern is checked against the entire URL, including domain, schema etc. |
Enable this rule | Specifies if the rule is enabled or not |
Stop processing more rules | By default, WebCopy will try and process all rules. You can use this flag to control this process; if set and the rule is matched, no further rules will be processed |
The Do not allow children to inherit this rule and Reverse flags have been deprecated and will be removed from a subsequent version of WebCopy.