17.3.7. HTTP indexing and search

The HTTP indexing/search configuration file is an .ini file, located at C:\Documents and Settings\<username>\ap\httpconfig.ini. This file is generated during installation.

The .ini file follows the Glib KeyFile format.

The structure of the file is the following:

[Text]
Whitelist=text/.*;.*json.*;
Blacklist=text/css;application/javascript;text/xslt;.*xml.*;

[Html]
Attributes=href;name;value;title;id;src;
StrippedTags=script;object;style;noscript;embed;video;audio;canvas;svg;

The elements of Whitelist and Blacklist are treated as regular expressions. The elements of Attributes and StrippedTags are treated as strings. The separator character is ";". If the regular expression contains ";", it has to be escaped: "\;".

The [Text] section contains general parameters that define text to be indexed/searched. The [Html] section handles the text/html content-type. Certain parts are stripped from the HTML content, such as tags, or even the content of the tags.

  • Whitelist: Patterns that are matched with the HTTP Content-Types. The listed MIME types are included in the indexing/search.

  • Blacklist: The elements of this list are excluded from indexing/search.

  • Attributes: The elements of this list will be indexed/searched.

  • StrippedTags: The tags and the content of the tags in this list are not indexed/searched. If the attributes of the tags are in the Attributes list, they will be indexed/searched.

After saving the changes to the .ini file, the new jobs will use the modified file and therefore will be indexed accordingly.

Note

If the httpconfig.ini is missing or syntactically incorrect, it reverts back to the default .ini file. This will be visible in the log messages also.

Note

Recursive indexing is not possible, if PSM has already indexed a trail, it will not be reindexed by AP.

The following parameters can be tokenized during search: url, hostname and content-type from headers. Both urlencoded GET and POST data are tokenized as key=value tokens. In the HTML content, the following metadata are indexed from the meta tags: description, keywords, author and application.