What is a rule?

At the most basic level a rule is a regular expression to match a given input (either a URL or the content type of the URL) and one or more flags to control behaviour.

When a website is crawled, for each detected URL all enabled rules are ran. If the expression for a rule is a match for the URL or content type, then behaviour actions are processed.

Important

If a rule fails to execute, for example due to an invalid regular expression, the rule will be automatically disabled.

Rule Inputs

Expression	Description
Compare	Allows you to choose what the pattern will be matched to, for example just the query string of a URL, or the content type
Expression	Specifies an expression that will be matched against the input. You can use regular expressions for more advanced control

If your expression includes any of the ^, [, ., $, {, *, (, \, +, ), |, ?, <, > characters and you want them to processed as plain text, you need to "escape" the character by preceding it with a backslash. For example, if your expression was application/epub+zip this would need to be written as application/epub\+zip otherwise the + character would have a special meaning and no matches would be made. Similarly, if the expression was example.com, this should be written as example\.com, as . means "any character" which could lead to unexpected matches.

Compare Options

This table outlines the different compare options available. The example match is based on the following sample address

http://www.example.com/folder/products?sort=name&order=asc

Option	Description	Example
Authority	The URL domain	`www.example.com`
Authority, Path, and Query String	The domain, path and query string of the URL	`www.example.com/folder/products?sort=name&order=asc`
Content Type	The detected content type of the URL	`text/html`
Entire URL	The complete URL	`http://www.example.com/folder/products?sort=name&order=asc`
Path	The path of the URL, including file names if applicable	`folder/products`
Path and Query String	The path and query string of the URL	`folder/products?sort=name&order=asc`
Query String	The query string of the URL	`sort=name&order=asc`

Operand	Description
Matches	Specifies the rule will be processed if the given input matches the rule expression
Does Not Match	Specifies the rule will be processed if the given input does not match the rule expression

Rule Options

Option	Description
Enable this rule	Specifies if the rule is enabled or not. Disabled rules will be ignored
Exclude	Specifies that the URL should not be downloaded
Exclude, but scan for additional content	Specifies that although the URL is excluded and will not be permanently saved, its contents should still be scanned (applies to HTML documents only). This means that although a permanent copy of the URL is not downloaded, a temporary copy is still made in order to scan for additional URLs to crawl.
Download	Specifies that the URL should be downloaded. This allows you to have a wider rule to exclude content, and then a narrower rule to include specific content.
Download, but do not scan for additional content	Specifies that although the URL is downloaded, its contents should not be scanned. This means that while a permanent copy of the URL is created, it will not be scanned for additional URLs to crawl. This option applies to HTML documents only
Stop processing more rules	By default all rules are processed sequentially. You can use this flag to control this process; if set and the rule is matched, no further rules will be processed
Download Priority	Allows the download priority for URLs matching the rule to be changed. High priority will mean the URL will be downloaded immediately, while Low means the URL will be downloaded when all other URLs have been processed¹.

¹ The Download Priority options is only supported for rules that match against a URL, it is ignored for rules matching against content types.

Cyotek WebCopy Help

What is a rule?

Important

Rule Inputs

Compare Options

Rule Options

See Also

Working with Rules