At the most basic level a rule is a pattern to match a given input (either a URI or the content type of the URI) and one or more flags to control behaviour. Currently WebCopy only supports regular expressions for the pattern.
When a website is crawled, for each detected URI all enabled rules are ran. If the pattern for a rule is a match for the URI or content type, then behaviour actions are processed.
If a rule fails to execute, for example due to an invalid regular expression, the rule will be automatically disabled.
|Compare||Allows you to choose what the pattern will be matched to, for example just the query string of a URI, or the content type|
|Expression||Specifies an expression that will be matched against the input. You can use regular expressions for more advanced control|
The Expression field uses regular expressions. If your search phrase includes any of the
> characters and you don't want them to be processed as part of the expression, you need to "escape" the character by preceding it with a backslash. For example, if your expression was
application/epub+zip this would need to be written as
application/epub\+zip otherwise the
+ character would have a special meaning and no matches would be made.
This table outlines the different compare options available. The example match is based on the following sample address
|Authority||The URI domain|
|Authority, Path, and Query String||The domain, path and query string of the URI|
|Content Type||The detected content type of the URI||n/a|
|Entire URL||The complete URL|
|Path||The path of the URL, including file names if applicable|
|Path and Query String||The path and query string of the URI|
|Query String||The query string of the URI|
|Matches||Specifies the rule will be processed if the given input matches the rule expression|
|Does Not Match||Specifies the rule will be processed if the given input does not match the rule expression|
|Enable this rule||Specifies if the rule is enabled or not. Disabled rules will be ignored|
|Exclude||Specifies that the URL should be excluded|
|Include||Specifies that the URL should be included. This allows you to have a wider rule to exclude content, and then a narrower rule to include specific content.|
|Crawl Content||Specifies that although the URL is excluded, its contents should still be scanned (applies to HTML documents only). This means that although a permanent copy of the URL is not downloaded, a temporary copy is still made in order to scan for additional URLs to crawl.|
|Don't Crawl Content||Specifies that although the URL is included, its contents should not be scanned (applies to HTML documents only). This means that while a permanent copy of the URL is created, it will not be scanned for additional URLs to crawl.|
|Stop processing more rules||By default all rules are processed sequentially. You can use this flag to control this process; if set and the rule is matched, no further rules will be processed|
|Download Priority||Allows the download priority for URLs matching the rule to be changed. High priority will mean the URL will be downloaded immediately, while Low means the URL will be downloaded when all other URLs have been processed.1|
1 The Download Priority options is only supported for rules that match against a URL, it is ignored for rules matching against content types.