In our previous tutorial we described how to define rules. This example follows from this and describes how you can use rules to crawl an entire website - but only save images.
To do an image-only copy of a website, we need to configure a number of rules.
Expression | Options |
---|---|
.* | Exclude, Crawl Content |
\.png | Include, Stop Processing |
\.gif | Include, Stop Processing |
\.jpg | Include, Stop Processing |
The first rule instructs WebCopy to not download any files at all to the save folder, but to still crawl HTML files. This is done by using the expression .*
to match all URLs, and the rule options Exclude and Crawl Content.
Each subsequent rule adds a regular expression to match a specific image extension, for example \.png
. and then uses the Include option to override the previous rule and cause the file to be downloaded. Once a match is made there is no need to continue checking rules, so the Stop Processing option is also set. Alternatively, you could just have a single rule which matched multiple extensions, for example \.(?:png|gif|jpg)
.
With these rules in place when you copy a website it will scan all HTML files but only download to the save folder those matching the specified extensions.
Add new rules with different extensions to copy different files, for example zip
, exe
or msi
to download programs.