Note
Currently the only supported browser engine is Internet Explorer. Support for Chromium and Gecko will be added in a future update.
WebCopy now includes limited support for crawling websites that are constructed using JavaScript. An external browser engine is used to crawl the website, along with existing WebCopy spider support.
Important
This feature is currently experimental and not feature complete
- The only browser engine currently supported is Internet Explorer
- Website crawling may be substantially slower
- Custom user agents are not supported
- Websites will be able to track you and use system cookies
- Malicious script served by compromised servers or rogue adverts will execute
- The content that WebCopy downloads may be different if the website uses browser sniffing to reduce functionality for IE users
Unsupported features
Interactive options are not supported, for example clicking nodes in a dynamic tree or scrolling a page which uses infinite scroll. There are no plans to add support to allow for user interactions to be dynamically performed.
Note that this still does not provide privileged access to a webserver, it is still not possible to download the raw source code of a website or its back-end databases unless the website specifically allows for this.
Future feature support
As noted, this is currently experimental. Future versions of WebCopy will include options for choosing between specific versions of Chromium or Gecko.
Enabling JavaScript support
To crawl websites with support for JavaScript execution enabled
- From the Project Properties dialogue, select the Web Browser option group.
- Check the Use web browse option
See Also
Configuring the Crawler
Working with local files
- Extracting inline data
- Remapping extensions
- Remapping local files
- Updating local time stamps
- Using query string parameters in local filenames
Controlling the crawl
- Content types
- Crawling above the root URL
- Crawling additional hosts
- Crawling additional root URLs
- Downloading all resources
- Including sub and sibling domains
- Limiting downloads by file count
- Limiting downloads by size
- Limiting scans by depth
- Limiting scans by distance
- Scanning data attributes
- Setting speed limits
- Working with Rules
Security
Modifying URLs
Creating a site map
Advanced
- Aborting the crawl using HTTP status codes
- Defining custom headers
- Following redirects
- HEAD vs GET for preliminary requests
- HTTP Compression
- Origin reports
- Saving link data in a Crawler Project
- Setting cookies
- Setting the web page language
- Specifying a User Agent
- Specifying accepted content types
- Using Keep-Alive