The HTTP protocol supports a number of "verbs" for issuing commands to a remote server. The most common is GET
, which requests a representation of the specified resource. A less common is HEAD
, which is similar to GET
but only returns the meta data about the resource, not the resource itself. If supported by remote hosts, this can reduce load by not loading the entire resource which is subsequently not used.
WebCopy uses the meta data (such as content type) to determine if a resource should be fully downloaded or skipped and so attempts to use HEAD
requests by default. The HTTP specification states that if a server does not support head, it should return status 405 (Method Not Allowed) but some servers return a misleading code such as 404 (Not Found) or 401 (Unauthorised).
When a new host is encountered during a crawl, and head checking is enabled, WebCopy will test the host by attempting to request the root document via HEAD
. If this is successful, HEAD
requests will be enabling for the host. If not successful, it will automatically disable HEAD
requests for that domain.
Unfortunately some web servers support HEAD
in piecemeal fashion - for one support request it was discovered that HEAD
was supported fine for requests that returned HTML, but for those that were returning images, a 404 was returned.
WebCopy allows you to disable heading checking at the project level. If this option is set, all automatic detection is disabled and all requests to retrieve resources will use GET
.
To enable or disable header checking
- From the Project Properties dialogue, select the Advanced category
- Check or uncheck the Use header checking option
See Also
Configuring the Crawler
Working with local files
- Extracting inline data
- Remapping extensions
- Remapping local files
- Updating local time stamps
- Using query string parameters in local filenames
Controlling the crawl
- Content types
- Crawling multiple URLs
- Crawling outside the base URL
- Downloading all resources
- Including additional domains
- Including sub and sibling domains
- Limiting downloads by file count
- Limiting downloads by size
- Limiting scans by depth
- Limiting scans by distance
- Scanning data attributes
- Setting speed limits
- Working with Rules
JavaScript
Security
- Crawling private areas
- Manually logging into a website
- TLS/SSL certificate options
- Working with Forms
- Working with Passwords
Modifying URLs
Creating a site map
Advanced
- Aborting the crawl using HTTP status codes
- Cookies
- Defining custom headers
- HTTP Compression
- Origin reports
- Redirects
- Saving link data in a Crawler Project
- Setting the web page language
- Specifying a User Agent
- Specifying accepted content types
- Using Keep-Alive