The HTTP protocol supports a number of "verbs" for issuing commands to a remote server. The most common is GET
, which requests a representation of the specified resource. A less common is HEAD
, which is similar to GET
but only returns the meta data about the resource, not the resource itself. If supported by remote hosts, this can reduce load by not loading the entire resource which is subsequently not used.
WebCopy uses the meta data (such as content type) to determine if a resource should be fully downloaded or skipped and so attempts to use HEAD
requests by default. The HTTP specification states that if a server does not support head, it should return status 405 (Method Not Allowed) but some servers return a misleading code such as 404 (Not Found) or 401 (Unauthorised).
When a new host is encountered during a crawl, and head checking is enabled, WebCopy will test the host by attempting to request the root document via HEAD
. If this is successful, HEAD
requests will be enabling for the host. If not successful, it will automatically disable HEAD
requests for that domain.
Unfortunately some web servers support HEAD
in piecemeal fashion - for one support request it was discovered that HEAD
was supported fine for requests that returned HTML, but for those that were returning images, a 404 was returned.
WebCopy allows you to disable heading checking at the project level. If this option is set, all automatic detection is disabled and all requests to retrieve resources will use GET
.