In our previous tutorial we described how to define rules. This example follows from this and describes how you can use rules to crawl an entire website - but only save images.
To do an image-only copy of a website, we need to configure a number of rules.
Expression | Options |
---|---|
.* | Exclude, Crawl Content |
\.png | Include, Stop Processing |
\.gif | Include, Stop Processing |
\.jpg | Include, Stop Processing |
The first rule instructs WebCopy to not download any files at all to the save folder, but to still crawl HTML files. This is done by using the expression .*
to match all URI's, and the rule options Exclude and Crawl Content.
Each subsequent rule adds a regular expression to match a specific image extension, for example \.png
. and then uses the Include option to override the previous rule and cause the file to be downloaded. Once a match is made there is no need to continue checking rules, so the Stop Processing option is also set. Alternatively, you could just have a single rule which matched multiple extensions, for example \.(?:png|gif|jpg)
.
With these rules in place when you copy a website it will scan all HTML files but only download to the save folder those matching the specified extensions.
Tip - add new rules with different extensions to copy different files, for example zip
, exe
or msi
to download programs.