The New Project Wizard is designed to bootstrap a copy project by asking a series of questions.

Opening the New Project Wizard

  • From the File menu, open the New sub menu and then choose New Project New Project Wizard

Selecting the URLs to copy

The What do you want to download? page allows you to select the source data you want to copy.

Enter the main address of the page to copy. You can also enter multiple URLs if required.

Important

WebCopy may not behave as expected if secondary URLs are located on a different domain to the primary address. This will be corrected in a future update to the software.

When you attempt to move to the next page in the Wizard, WebCopy will examine the URL you have entered. If it is deeply nested, for example https://demo.cyotek.com/features, the Wizard will prompt to enable crawling above the root. If this option is not set, WebCopy will ignore any URLs above the entered path, for example https://demo.cyotek.com/html.

Choosing local file options

The Where do you want to download to? page allows you to choose the folder where downloaded files will be saved, and how WebCopy processes the files and folders within.

In the Save folder field, enter or select the local path where you want the copied files to be stored. If the Create folder for domain option is checked, copied files will be stored in a sub-folder named after the primary domain.

Checking the Flatten website folder option will cause all downloaded files to be placed in the same folder. Normally WebCopy will create sub-folders that mirror the structure of the website being copied.

If the Empty website folder before copy option is checked, any files located in the save folder will be deleted prior to copy. It is not recommended this option is set if you want to copy only new and changed files in subsequent scans.You can optionally tell WebCopy to put deleted files in the Recycle Bin. To do this, check the Use Recycle Bin option. If this option is unchecked, files will be permanently erased.

Important

The WebCopy GUI application will prompt to continue if the Empty website folder before copy option is set and files are present in the destination folder. The console CLI client will not prompt.

Choosing how to copy the website

The How do you want to crawl the website? page allows you to configure how you want to crawl the website.

OptionNotes
Site OnlyOnly crawls URLs that match the host name specified in the crawl URL
Sub domainsIncludes any sub domains of the host URL
Sibling domainsIncludes both sub domains and sibling domains of the host URL
EverythingWill crawl any discovered HTTP or HTTPS URL unless excluded via other settings

Regardless of the setting above, if the Download all resources option is checked then WebCopy will still query resources on other domains and download any non-HTML content, unless the URL is excluded by custom rules.

Use of the Everything option is not recommended and should only be used on sites which are self contained or where rules are used to explicitly exclude addresses. Use of this option may cause WebCopy to become unstable.

Setting the Limit crawl depth option can be used to control how deep a crawl is performed, based on the number of path segments in each crawled URL.

The Limit distance from root URL option can be used to only copy files that are within a given number of jumps from the start page. This can be very useful for copying only the first level of links from a particular page without having to configure rules.

Important

This feature is currently experimental and not feature complete.

Choosing the type of files to download

The What files do you want to download? page allows you to fine tune what type of content will be downloaded. You can either choose to download all files (including HTML documents), or groups of files based on their type, such as images, video or documents.

Note

Anything you select here will be converted into a series of rules, you can then adapt these rules to cover any additional content type you require.

Excluding content

The What do you want to exclude? page allows you to enter paths or document names you want to exclude from the crawl.

Scanning JavaScript pages

If you need to copy websites that are created via JavaScript, you can configure this behaviour from the Do you need to support JavaScript enabled pages? page.

Important

Enabling JavaScript support is experimental functionality and may not work as expected. Please review the overview topic for details and limitations of this feature.

Reviewing settings

The final page of the Wizard displays a summary of your chosen options. Click Finish to generate a new project.

© 2010-2024 Cyotek Ltd. All Rights Reserved.
Documentation version 1.10 (buildref #186.15944), last modified 2024-08-18. Generated 2024-08-18 08:01 using Cyotek HelpWrite Professional version 6.20.0