Cyotek has provided a demonstration website, demo.cyotek.com
that can be used to test many WebCopy features. A corresponding project, demo.cwp
, ships with WebCopy and is designed to work with this site and demonstrate functionality.
This topic describes the customisations made to the project as a guide to how you may wish to use similar functionality when creating your own copy jobs.
The website has a faux protected area that you can log into using a form. The Capture tool was used to create a definition for automatically logging into the website before copying starts.
The presence of features/authenticationprofile.html
in the downloaded website indicates the login was successful.
Although a little old fashioned these days as many websites prefer to use form based authentication as above, some websites still do issue 401 challenges. The project has two passwords defined, a global one which will be used for any challenge by the site, and another which is tied to a specific page.
If you copy the website and open statuscodes/4xx/401basic.html
you will note that it references guest2
, whilst statuscodes/4xx/401digest.html
will reference guest3
, showing that the different credentials were used accordingly.
Rules can be used to exclude parts of a website. In this demo, a rule has been created to exclude the /features/downloadtest.php
URL as the files it creates aren't actual valid.
Another rule also exists to exclude /features/authenticationlogout.php
as it is somewhat pointless logging into a website only to log right back out again!
The Download All Resource option is used to allow non-HTML resources to be downloaded from any location, unless explicitly disabled by rules. By default, the Download All Resources option is set for new projects, however the project has this disabled. This stops the crawl from hitting third party sites.
On occasion there may not be a direct link to a resource you want to copy. Several pages on the demonstration site are only accessible via JavaScript so as not to cause a default crawl to display a bunch of errors (it is a testing website after all!) and so the project also includes a pair of additional URLs to hit the 401 challenge pages mentioned above.
The free text field also demonstrates the use of comments and whitespace to break up the URL list.
Some websites use data attributes to contain links to other resources, typically images. The project defines one custom attribute, data-original
used to pick up a pair of images in the website that otherwise wouldn't be detected.
The files assets/img/background3.png
and assets/img/background4.png
will be present in the copied website if this attribute is defined.
Request headers can be used to direct the server how to respond, for example to use a different language or enable specific compression options. Custom headers are also supported which could be used for various purposes - for example an API key to access a resource. The project defines a single custom header, X-Transport-Version
, to send with every request.
If you copy the website and open features/requestheaders.html
the custom header and value will be referenced in this page.
Cookies are often used for storing session information. WebCopy allows cookies to be defined in the project, or read from an external file - useful for working around authentication issues where JavaScript or MFA is involved. The project defines a pair of example cookies, one using simple name=value
pairing, the second including a maximum age.
If you copy the website and open features/cookies.html
the custom cookies (along with the ones provided by the site itself) will be listed on this page.