In our previous tutorial we described how to define rules. This example follows from this and describes how you can use rules to crawl an entire website - but only save images.

Copying only images from a website

To do an image-only copy of a website, we need to configure a number of rules.

ExpressionOptions
.*Exclude, Crawl Content
\.pngInclude, Stop Processing
\.gifInclude, Stop Processing
\.jpgInclude, Stop Processing

The first rule instructs WebCopy to not download any files at all to the save folder, but to still crawl HTML files. This is done by using the expression .* to match all URLs, and the rule options Exclude and Crawl Content.

Each subsequent rule adds a regular expression to match a specific image extension, for example \.png. and then uses the Include option to override the previous rule and cause the file to be downloaded. Once a match is made there is no need to continue checking rules, so the Stop Processing option is also set. Alternatively, you could just have a single rule which matched multiple extensions, for example \.(?:png|gif|jpg).

With these rules in place when you copy a website it will scan all HTML files but only download to the save folder those matching the specified extensions.

Tip

Add new rules with different extensions to copy different files, for example zip, exe or msi to download programs.

© 2010-2024 Cyotek Ltd. All Rights Reserved.
Documentation version 1.10 (buildref #185.15779), last modified 2024-03-31. Generated 2024-03-31 14:04 using Cyotek HelpWrite Professional version 6.19.1