Parsing web server access logs

If you operate web servers, you want to have insight about your traffic. Traditional solutions to process access logs include:

  • scripts to create nightly reports with tools like AWStats
  • run a JavaScript snippet on each page load, like Google Analytics,
  • or combine the two methods, like Piwik.

But if you want to use your logs in operation, you are better off using syslog-ng and message parsing, as it gives you a lot more flexibility.

Access logs have a columnar data format, where Space acts as the delimiter between separate fields in the log message. Each message has the same information: the client address, the authenticated user, the time, and so on.

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

Logs without parsing are not really useful in syslog-ng, since you can only forward or store them for subsequent processing. But if you parse your web server logs in real time instead of using daily or hourly reports, you can react to events as they happen. For example, you can:

The apache-access-log parser of syslog-ng creates a new name-value pair for each field of the log message, and does some additional parsing to get more information.

The apache-accesslog-parser()

When you have generic columnar logs (for example, a list of tab-separated or comma-separated values), you can parse those using the CSV parser in syslog-ng. For your Apache access logs (or any other web server that uses the Common or Combined log format) you can use the Apache Access Log Parser. It has been fine-tuned to correctly handle access logs, so you should use this instead of the generic parser to save yourself some time.

Make sure that you are running at least syslog-ng version 3.8, and that the following line is included in your syslog-ng.conf:

@include "scl.conf"

(Scl.conf refers to the syslog-ng configuration library. You can read more about the power of SCL on reusing configuration blocks in the documentation.)

Using the apache-accesslog-parser()

Let's look at the following example. There is an optional parameter, prefix(), which allows you to configure what prefix would you like to use in front of the freshly created name-value pairs. By default it is ".apache.". The format-json template function replaces the leading dot with an underscore. You can obviously change this if you are forwarding logs to an application where fields beginning with an underscrore have a special meaning , for example, in Elasticsearch.

parser parser_name {
    apache-accesslog-parser(
        prefix(“apache.”)
    );
};

Log sources

Traditionally, access logs arrive to syslog-ng through file sources. Logging to files is default both in the Apache and Nginx web servers. The drawback of this solution is that log messages are stored twice: once by the web server and once by syslog-ng. You also need to rotate the log files. Fortunately, there are other methods which help you to avoid this overhead.

Apache httpd supports writing log messages into a pipe, and syslog-ng can read from pipes. In this case, instead of using an intermediary file, Apache sends the logs directly to syslog-ng through the pipe.

Nginx can use the old BSD syslog protocol to send logs through a UDP connection. It is not state of the art, and can lead to message loss if your web server has high traffic. Still, it can simplify your logging infrastructure considerably.

Note that when you use a file or pipe source, the message arrives without a syslog header. This means that you have to use the flags (no-parse) in the source, otherwise syslog-ng tries to interpret it as a syslog message and you will get unexpected results.

source s_access {
  file("/var/log/httpd/access_log" flags(no-parse));
};

Using virtual hosts

The method above works perfectly if you only have a single website. If you have multiple websites (virtual servers) that use the same web server, then there is a problem: the name of the virtual server is not included in the log message. You either need to define many log files both on the web server and in syslog-ng (well, if you are using syslog-ng Premium Edition, then you can simply use wildcards in the source path), or you loose some critical information in the name of the virtual host. Alternatively, you can define your own log format.

In case of Apache httpd, add "%v" to the description of your log format to include the virtual host name in the logs. For details and other possibilities, check the Apache documentation about logging.

Obviously, if you have a new field in your log file, you also need to add it to the parser configuration. You can find the Apache parser in the SCL directory. In case of openSUSE, the file is /usr/share/syslog-ng/include/scl/apache/apache.conf and it should be similar in other distributions. You need to add the field name matching the field order of the Apache configuration at this part of the config:

# field names match of that of Logstash
columns("clientip", "ident", "auth",
  "timestamp", "rawrequest", "response",
  "bytes", "referrer", "agent"));

Example configuration

Here is a complete example syslog-ng configuration. This one reads the web server logs from a file, parses them with the apache-access-log-parser() and sends the results to Elasticsearch. There is also a JSON file destination, commented out in the log path, which can be used for debugging.

# source: apache access log file
source s_access {
  file("/var/log/httpd/access_log" flags(no-parse));
};
 
# destination: elasticsearch server
destination d_elastic {
  elasticsearch2 (
    cluster("syslog-ng")
    client_mode("http")
    index("syslog-ng")
    type("test")
    template("$(format-json --scope rfc5424 --scope nv-pairs --exclude DATE --key ISODATE)")
  )
};
 
# destination: JSON format with same content as to Elasticsearch
destination d_json {
  file("/var/log/test.json"
    template("$(format-json --scope rfc5424 --scope nv-pairs --exclude DATE --key ISODATE)\n\n"));
};
 
# parser for apache access log
parser p_access {
  apache-accesslog-parser(
    prefix("apache.")
  );
};
 
# magic happens here: all building blocks connected together
log {
  source(s_access);
  parser(p_access);
  # destination(d_json);
  destination(d_elastic);
};

If you want to try this on your web server, install syslog-ng 3.8.1 or later. If this is not in your distribution, you can download it from here. For further ideas on processing your logs, see some of my earlier posts:

Related Content