WebCopy can create create report files which identify the original location of each downloaded resource.
No origin reports
To exclude origin report generation
- From the Project Properties dialogue, select the Advanced category
- Set the Origin report field to None
- Ensure the Add to source HTML option is not checked
Embedded origin reports
To embed the origin in the source document
- From the Project Properties dialogue, select the Advanced category
- Set the Origin report field to None. Alternatively, to create file based reports in addition to embedding, select another option
- Check the Add to source HTML option
Note
Currently only HTML documents support embedded origins
Creating one origin report per URL
To create one origin report for unique URL
- From the Project Properties dialogue, select the Advanced category
- Set the Origin report field to Create a single file for each URL
- Optionally, check the Add to source HTML option to include embedded origin reports where applicable in addition the file-based report
When a file is downloaded, the origin will be written to a file with the same name as the local file name, but with a .origin.txt
suffix. For example, the origin report for index.html
would be index.html.origin.txt
.
Each report includes the remote URL, the fully qualified local file name, and the content type of the resource.
Creating one origin report for the project
To create one origin report that has all details for the project
- From the Project Properties dialogue, select the Advanced category
- Set the Origin report field to Create a single report for the entire project
- Optionally, check the Add to source HTML option to include embedded origin reports where applicable in addition the file-based report
After the site has been downloaded, an origin report containing all processed URLs will be written to a file named webcopy-origin.txt
, located in the save folder.
Each entry in the report includes the remote URL, the fully qualified local file name, and the content type of the resource.
See Also
Configuring the Crawler
Working with local files
- Extracting inline data
- Remapping extensions
- Remapping local files
- Updating local time stamps
- Using query string parameters in local filenames
Controlling the crawl
- Content types
- Crawling multiple URLs
- Crawling outside the base URL
- Downloading all resources
- Including additional domains
- Including sub and sibling domains
- Limiting downloads by file count
- Limiting downloads by size
- Limiting scans by depth
- Limiting scans by distance
- Scanning data attributes
- Setting speed limits
- Working with Rules
JavaScript
Security
- Crawling private areas
- Manually logging into a website
- TLS/SSL certificate options
- Working with Forms
- Working with Passwords
Modifying URLs
Creating a site map
Advanced
- Aborting the crawl using HTTP status codes
- Cookies
- Defining custom headers
- HEAD vs GET for preliminary requests
- HTTP Compression
- Redirects
- Saving link data in a Crawler Project
- Setting the web page language
- Specifying a User Agent
- Specifying accepted content types
- Using Keep-Alive