Sometimes you may wish to adjust a URL that Sitemap Creator has detected when crawling a page, before the URL is further processed. For example, you might want to convert simple JavaScript navigation links into a URL, or remove a middle-man redirection page.
To configure the URL transforms
- From the Project Properties dialogue, expand the Advanced category and select the URL Transforms option
Adding a new transform
- Click the Add button
- In the Expression field, enter the search pattern using a regular expression
- Enter the text to replace the matched pattern with in the Replacement field
- Optionally, if you only wish the transform to be executed for specific pages, enter the URL into the URL Expression field. This field supports regular expressions.
Important
If your expression includes any of the ^
, [
, .
, $
, {
, *
, (
, \
, +
, )
, |
, ?
, <
, >
characters and you want them to processed as plain text, you need to "escape" the character by preceding it with a backslash. For example, if your expression was application/epub+zip
this would need to be written as application/epub\+zip
otherwise the +
character would have a special meaning and no matches would be made. Similarly, if the expression was example.com
, this should be written as example\.com
, as .
means "any character" which could lead to unexpected matches.
Deleting a transform
- Select the transform you wish to remove from the list
- Click the Delete button
Updating a transform
- Select the transform you wish to edit from the list. The Expression, Replacement and URL Expression fields will be updated to match the selection
- Enter new values as appropriate. The selected item in the list will be updated with the changes you specify
Changing the order transform are processed
- Select the transform you wish to move
- Click the Move Up and Move Down buttons to re-order the transform list
See Also
Configuring the Crawler
Working with local files
- Extracting inline data
- Remapping extensions
- Remapping local files
- Updating local time stamps
- Using query string parameters in local filenames
Controlling the crawl
- Content types
- Crawling above the root URL
- Crawling additional root URLs
- Including additional domains
- Including sub and sibling domains
- Limiting downloads by file count
- Limiting downloads by size
- Limiting scans by depth
- Limiting scans by distance
- Scanning data attributes
- Setting speed limits
- Working with Rules
JavaScript
Security
- Crawling private areas
- Manually logging into a website
- TLS/SSL certificate options
- Working with Forms
- Working with Passwords
Modifying URLs
Advanced
- Aborting the crawl using HTTP status codes
- Cookies
- Defining custom headers
- Following redirects
- HEAD vs GET for preliminary requests
- HTTP Compression
- Modifying page titles
- Origin reports
- Overwriting read only files
- Saving link data in a Crawler Project
- Setting the web page language
- Specifying a User Agent
- Specifying accepted content types
- Using Keep-Alive