The HTTP protocol supports a number of "verbs" for issuing commands to a remote server. The most common is GET, which requests a representation of the specified resource. A less common is HEAD, which is similar to GET but only returns the meta data about the resource, not the resource itself. If supported by remote hosts, this can reduce load by not loading the entire resource which is subsequently not used.

WebCopy uses the meta data (such as content type) to determine if a resource should be fully downloaded or skipped and so attempts to use HEAD requests by default. The HTTP specification states that if a server does not support head, it should return status 405 (Method Not Allowed) but some servers return a misleading code such as 404 (Not Found) or 401 (Unauthorised).

When a new host is encountered during a crawl, and head checking is enabled, WebCopy will test the host by attempting to request the root document via HEAD. If this is successful, HEAD requests will be enabling for the host. If not successful, it will automatically disable HEAD requests for that domain.

Unfortunately some web servers support HEAD in piecemeal fashion - for one support request it was discovered that HEAD was supported fine for requests that returned HTML, but for those that were returning images, a 404 was returned.

WebCopy allows you to disable heading checking at the project level. If this option is set, all automatic detection is disabled and all requests to retrieve resources will use GET.

To enable or disable header checking

  1. From the Project Properties dialogue, select the Advanced category
  2. Check or uncheck the Use header checking option

See Also

Configuring the Crawler

Working with local files

Controlling the crawl

JavaScript

Security

Modifying URLs

Creating a site map

Advanced

Deprecated features

© 2010-2024 Cyotek Ltd. All Rights Reserved.
Documentation version 1.10 (buildref #185.15779), last modified 2024-03-31. Generated 2024-03-31 14:04 using Cyotek HelpWrite Professional version 6.19.1