Cyotek has provided a demonstration website, demo.cyotek.com
that can be used to test many WebCopy features. A corresponding project, demo.cwp
, ships with WebCopy and is designed to work with this site and demonstrate functionality.
This topic describes the customisations made to the project as a guide to how you may wish to use similar functionality when creating your own copy jobs.
Forms
The website has a faux protected area that you can log into using a form. The Capture tool was used to create a definition for automatically logging into the website before copying starts.
The presence of features/authenticationprofile.html
in the downloaded website indicates the login was successful.
Passwords
Although a little old fashioned these days as many websites prefer to use form based authentication as above, some websites still do issue 401 challenges. The project has two passwords defined, a global one which will be used for any challenge by the site, and another which is tied to a specific page.
If you copy the website and open statuscodes/4xx/401basic.html
you will note that it references guest2
, whilst statuscodes/4xx/401digest.html
will reference guest3
, showing that the different credentials were used accordingly.
Rules
Rules can be used to exclude parts of a website. In this demo, a rule has been created to exclude the /features/downloadtest.php
URL as the files it creates aren't actual valid.
Another rule also exists to exclude /features/authenticationlogout.php
as it is somewhat pointless logging into a website only to log right back out again!
Download All Resources
The Download All Resource option is used to allow non-HTML resources to be downloaded from any location, unless explicitly disabled by rules. By default, the Download All Resources option is set for new projects, however the project has this disabled. This stops the crawl from hitting third party sites.
Additional URLs
On occasion there may not be a direct link to a resource you want to copy. Several pages on the demonstration site are only accessible via JavaScript so as not to cause a default crawl to display a bunch of errors (it is a testing website after all!) and so the project also includes a pair of additional URLs to hit the 401 challenge pages mentioned above.
The free text field also demonstrates the use of comments and whitespace to break up the URL list.
Custom Attributes
Some websites use data attributes to contain links to other resources, typically images. The project defines one custom attribute, data-original
used to pick up a pair of images in the website that otherwise wouldn't be detected.
The files assets/img/background3.png
and assets/img/background4.png
will be present in the copied website if this attribute is defined.
Custom Headers
Request headers can be used to direct the server how to respond, for example to use a different language or enable specific compression options. Custom headers are also supported which could be used for various purposes - for example an API key to access a resource. The project defines a single custom header, X-Transport-Version
, to send with every request.
If you copy the website and open features/requestheaders.html
the custom header and value will be referenced in this page.