The New Project Wizard is designed to bootstrap a copy project by asking a series of questions.
Opening the New Project Wizard
- From the File menu, open the New sub menu and then choose New Project
Selecting the URLs to copy
The What do you want to download? page allows you to select the source data you want to copy.
Enter the main address of the page to copy. You can also enter multiple URLs if required.
WebCopy may not behave as expected if secondary URLs are located on a different domain to the primary address. This will be corrected in a future update to the software.
When you attempt to move to the next page in the Wizard, WebCopy will examine the URL you have entered. If it is deeply nested, for example
https://demo.cyotek.com/features, the Wizard will prompt to enable crawling above the root. If this option is not set, WebCopy will ignore any URLs above the entered path, for example
Choosing local file options
The Where do you want to download to? page allows you to choose the folder where downloaded files will be saved, and how WebCopy processes the files and folders within.
In the Save folder field, enter or select the local path where you want the copied files to be stored. If the Create folder for domain option is checked, copied files will be stored in a sub-folder named after the primary domain.
Checking the Flatten website folder option will cause all downloaded files to be placed in the same folder. Normally WebCopy will create sub-folders that mirror the structure of the website being copied.
If the Empty website folder before copy option is checked, any files located in the save folder will be deleted prior to copy. It is not recommended this option is set if you want to copy only new and changed files in subsequent scans.You can optionally tell WebCopy to put deleted files in the Recycle Bin. To do this, check the Use Recycle Bin option. If this option is unchecked, files will be permanently erased.
The WebCopy GUI application will prompt to continue if the Empty website folder before copy option is set and files are present in the destination folder. The console CLI client will not prompt.
Choosing how to copy the website
The How do you want to crawl the website? page allows you to configure how you want to crawl the website.
|Site Only||Only crawls URLs that match the host name specified in the crawl URL|
|Sub domains||Includes any sub domains of the host URL|
|Sibling domains||Includes both sub domains and sibling domains of the host URL|
|Everything||Will crawl any discovered HTTP or HTTPS URL unless excluded via other settings|
If the Download all resources option is checked then any non-HTML content downloaded even if it is found on a different domain, unless the content is excluded by custom rules.
Setting the Limit crawl depth option can be used to control how deep a crawl is performed, based on the number of path segments in each crawled URL.
The Limit distance from root URL option can be used to only copy files that are within a given number of jumps from the start page. This can be very useful for copying only the first level of links from a particular page without having to configure rules.
This feature is currently experimental and not feature complete.
Choosing the type of files to download
The What files do you want to download? page allows you to fine tune what type of content will be downloaded. You can either choose to download all files (including HTML documents), or groups of files based on their type, such as images, video or documents.
Anything you select here will be converted into a series of rules, you can then adapt these rules to cover any additional content type you require.
The What do you want to exclude? page allows you to enter paths or document names you want to exclude from the crawl.
The final page of the Wizard displays a summary of your chosen options. Click Finish to generate a new project.
- Configuring the crawler
- Specifying the web site to copy
- Crawling additional root URLs
- Configuring the output location
- Including sub and sibling domains
- Limiting scans by depth
- Working with Rules
- Crawling above the root URL