Sensible Logging with ErK (Elasticsearch, rsyslogd & Kibana)

System admins have long understood the importance of centralised logging and accurate timestamps across the environments they manage, ever since there were networks we’ve been shipping logfiles around and collating them together to better understand our environments. But times are changing and there’s an emerging trend in structured, searchable logging. In this article I’ll discuss some of my experiences with utilising Elastic’s ELK Stack (elasticsearch, logstash and kibana) for providing structured log indexing and searching but utilising the standard linux rsyslogd for transporting our log data.

ELK — Image by Leupold Jim, U.S. Fish and Wildlife Service. Public Domain.

As more people need to slice and dice data in new and different ways; we need a way of centrally logging complex environments in a way that’s more suited for these requirements than the traditional progressively appended text logfiles and furthermore we want to be able to give that power to anybody who needs it without teaching them the arcane invocations of awk, grep and perl. Enter the ELK stack. ELK is made up of several components:

elasticsearch – the auto-clustering; efficient searchable data object indexer based on the Apache Lucene engine (I’ve used Lucene in various forms for the last 10 years and it does a great job of efficient free text searching – tokenisation of text data and indexing words along with more advanced fuzzy searching; but elasticsearch takes that engine to a whole new level).

logstash – log store and shipping service for movig log files around – I’ll discuss this in more detail later.

kibana – the visualisation and query tool for Elasticsearch.

We are however going to drop logstash from discussion in this article and instead focus on utilising a tool found on every debian box that every Linux sysadm will be comfortable with in it’s stead – rsyslogd.

Background

In this article I hope to demonstrate some of the power of elasticsearch and it’s visualisation tool kibana and provide some guidance I believe will help you get the most out of this solution.

Before you jump in the deep end; it’s important to understand the key difference between using Elasticsearch and plain text files for logging; elasticsearch is best suited for indexing structured data; log files tend to be unstructured – a stream of data from applications, system and anything else that can produce a record of what it’s currently up to while structured data is best carved up and easily ingestible – as part of evaluating the right logging solution for your needs, you need to be thinking of about the kind of data you’re looking at and how you may want to deal with it.

If you’ve not had much exposure to full text indexing or Elasticsearch specifically; there’s some key terms you’ll need to be aware of.

Cluster – a group of elasticsearch nodes; Elasticsearch is designed to play in groups, sharing the load, data and ensuring redundancy. A group of elasticsearch nodes talking together is a cluster.

Node – an individual elasticsearch instance participating in an elasticsearch cluster.

Index – is like a database table, it’s a collection of records.

Getting Started

Before you go much futher; if you want to play along at home – it’s best you have an elasticsearch cluster up and running – Elasticsearch works better with friends – so we’re setting up a proper elasticsearch cluster – so I’ll be using two VMs in my Proxmox cluster; configured with 4 cores each; 32GB root disks and 200GB mounted in /data/elastic running Debian 8. So get cracking with your orchestration tool of choice and spin them up! In this demo they’re named galahad and gawain (Knights searching for the holy grail) and will sit in an elasticsearch cluster named pyspace. There’s some great elasticsearch docker buildfiles too; which I use in another enviornment – but I’ll leave that as an excerise to the reader.

Tip – Security

Elasticsearch & Kibana do not provide security or authentication on their own; and I believe this is a good thing as it allows you to wrap them in your own security mechanisms and chose the best solution for you. For the case of this example all hosts are on a private LAN to which I have a routed VPN connection from my main work/development laptop. A public facing reverse proxy server utilising HTTPS and requiring authentication allows internet-facing access to the kibana instance; however both the elasticsearch cluster and kibana are accessible by all internal network nodes (including those on VPN) without authentication. You should carefully consider all your own security requirements before deploying any technology solution.

You’ll want a couple of basics on each of the boxes so apt-get both apt-transport-https and default-jre-headless to get you started then add the required bits to grab the packages from the elastic apt repository (as root):

apt-get install apt-transport-https
wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearrawch | apt-key add -
echo "deb https://packages.elastic.co/elasticsearch/2.x/debian stable main" | tee -a /etc/apt/sources.list.d/elastic.list
apt-get update
apt-get install default-jre-headless elasticsearch

You’ll want to quickly configure elasticsearch – there isn’t much to do to get the basics up and running these days; so jump into /etc/elasticsearch/elasticsearch.yml with your favourite text editor.

cluster.name: pyspace
path.data: /data/elastic
network.host: _eth0_, _lo_
discovery.zen.ping.unicast.hosts: ["galahad.int.pydev.space", "gawain.int.pydev.space"]

This same elasticsearch configuration can be used on both your hosts. It specifies the cluster.name which needs to be the same on all your cluster nodes; the path to store data in (which we mounted as part of the system/container build above); what to listen on (note: _eth0_ and _lo_ in my case; to allow the same config to work on all nodes; knowing eth0 is my internal network and not public facing see security note above!) and lastly we need to get our zen disco (discovery) up and running by providing some nodes who may know about the cluster; in this case my two nodes FQDN.

Collecting Logs

Tip – Classify your Logging

The first mistake I believe most people make when considering logging with elasticsearch is just throwing all their data in and hoping for the best. This is certainly what I did when I first installed elasticsearch for the purpose of capturing system log data; but it’s a condition I have recovered from. The age old adage applies here garbage in = garbage out.

Not all logging belongs in your elasticsearch cluster; it should only be stored in your cluster if you plan to use it and you have some idea how you may want to use it. Some logging will always be best in a stream of text, but I do strongly recommend those streams of text are also centralised!

So before you dump *.* into your cluster without a second though. STOP. Don’t do it – at least until you understand why you want to do it. Start simple and get some useful data in.

For the demonstration here, we’re going to use apache access logs; and the reason we want them in Elasticsearch is to do our own analysis on them, and potentially even traffic accounting and while we’ll be using rsyslogd to transport them, we’ll be doing it in such a way that it doesn’t interfere with our existing system logs.

Apache access logs are nicely structured for our needs and suit our example perfectly (with thanks to an overseas scriptkiddie who spent hours today trying to login to a wordpress instance hosted on this box for this example).

downunderderby.com:443 91.200.12.83 - - [20/Aug/2016:10:02:08 -0400] "POST /wp-login.php HTTP/1.1" 200 3974 "https://downunderderby.com/wp-login.php" "Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1; 125LA; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)"

Now on the servers I wish to collect logs from (in this case my web servers are all named after chemical elements and operated in a loadbalanced environment, so we’re on sodium for this example); I’m utilising the following logformat and configuration:

LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined
CustomLog ${APACHE_LOG_DIR}/other_vhosts_access.log vhost_combined

Next we’ll want to configure rsyslogd to watch this file as an input; so in /etc/rsyslogd.d/10_apache_other_vhost.conf

module(load="imfile")

input(type="imfile"
      File="/var/log/apache2/other_vhosts_access.log"
      stateFile="/var/run/rsyslog/vhosts.state"
      Tag="apache"
)

You can repeat this for any subsequent logfiles you also wish to read from, being aware that you will only be able to import the module (first line) once. This configuration will allow rsyslog to read those log files and process any new lines as new system log messages to be handled by the rest of the log rules. imfile is the file input module for rsyslog; we specify the File we want to read, the stateFile which contains information about the current read position and a Tag which we use later when interacting with the records read by this log input.

Next we’ll want to define some actions to take for any messages we received that are tagged with apache in order to transform them into structured data and transport them off to elasticsearch; so in a new configuration file /etc/rsyslogd.d/99_apache_es.conf; we’ll first load some required modules mmnormalize which converts logs to JSON utilising a rule and omelasticsearch the elasticsearch output module.

We’ll also define some templates to use, and a rule and actions to take:

module(load="mmnormalize")
module(load="omelasticsearch")


template(name="logstash-index"
  type="list") {
    constant(value="logstash-")
    property(name="timereported" dateFormat="rfc3339" position.from="1" position.to="4")
    constant(value=".")
    property(name="timereported" dateFormat="rfc3339" position.from="6" position.to="7")
    constant(value=".")
    property(name="timereported" dateFormat="rfc3339" position.from="9" position.to="10")
}

template(name="logstash-accesslog" type="list" option.json="on") {
        constant(value="{")
        constant(value="\"@timestamp\":\"")             property(name="timereported" dateFormat="rfc3339")
        constant(value="\",\"message\":\"")             property(name="msg")
        constant(value="\",\"host\":\"")                property(name="fromhost-ip")
        constant(value="\",\"@source_host\":\"")        property(name="hostname")
        constant(value="\",\"tag\":\"")                 property(name="syslogtag")
        constant(value="\",\"vhost\":\"")               property(name="$!vhost")
        constant(value="\",\"vport\":\"")               property(name="$!vport")
        constant(value="\",\"bytes\":")               property(name="$!bytesend")
        constant(value=",\"clientip\":\"")            property(name="$!ip")
        constant(value="\",\"auth\":\"")    	        property(name="$!auth")
        constant(value="\",\"method\":\"")              property(name="$!method")
        constant(value="\",\"request\":\"")             property(name="$!url")
        constant(value="\",\"pversion\":\"")            property(name="$!pver")
        constant(value="\",\"referrer\":\"")            property(name="$!referrer")
        constant(value="\",\"useragent\":\"")           property(name="$!useragent")
        constant(value="\",\"status\":\"")              property(name="$!status")
        constant(value="\"}")
}

if $syslogtag == 'apache' then {
   action(type="mmnormalize" userawmsg="off" rulebase="/etc/rsyslog.d/apache_accesslog.rule")
   action(type="omelasticsearch"
          server="elasticsearch.int.pydev.space"
          serverport="9200"
          template="logstash-accesslog"
          searchIndex="logstash-index"
          dynSearchIndex="on"
          searchType="logstash-index"
          bulkmode="on"
          queue.type="linkedlist"
          queue.size="5000"
          queue.dequeuebatchsize="300"
          action.resumeretrycount="-1"
          errorFile="/var/log/rsyslog.es-error.log")
    stop
}

The two templates we’ve defined will be used to mimic logstash style messages. The logstash-index template will generate a correctly formatted string to be used for the elasticsearch index e.g. logstash-YYYDDMM as the Index. The logstash-accesslog template maps the fields that are received from mmnormalize and maps them into the correct fields and datatypes to match what logstash would usually provide. The key one here is timestamp as kibana does better magic with timestamped data.

Paying attention to the omelasticsearch definition; we define the server and port number used to communicate with our elasticsearch cluster; I use a DNS roundrobin to point to each node of the cluster which utilises the built in redundancy provided by elasticsearch. The template option specifies the output template to use; and searchIndex and dynSearchIndex point to the required items to generate the name of our index as mentioned above. Some of the real power of logstash comes from it’s bulk shipping and guaranteed delivery of logs; but rsyslogd can do this for us too; so we enable bulkmode and set up a queue of messages to send to elasticsearch with the next few options.

Update: An issue was found in my original post; and has been fixed. You’ll notice the “bytes” and “clientip” line above for constructing the JSON data template; I’ve removed the backslashed quotes to ensure the bytes value is passed as a number (so I can do aggregation and calculations on the resulting elasticsearch records).

The stop is useful for preventing these entries from appearing in your normal system logs (messages and syslog); but for debugging purposes you may want to leave the stop out.

We’re almost done with rsyslog configuration; lastly you’ll need to provide a rulefile for mmnormalize to use for parsing the text into the first set of JSON variables that are used by the logstash-accesslog template; this is straightforward and I’ve placed it in /etc/rsyslog.d/apache_accesslog.rule

version=2

rule=:%vhost:char-to::%:%vport:number% %ip:word% %ident:word% %auth:word% [%timereported:char-to:]%] "%method:word% %url:word% HTTP/%pver:char-to:"%" %status:number% %bytesend:number% "%referrer:char-to:"%" "%useragent:char-to:"%"

Note: There appears to be evidence of a better way of handling the input file and mmnormalize steps of this utilising inbuilt input templates in rsyslogd and directly creating a logstash style output message; however I am yet to have much luck in implementing it in a way that is as reliable as the way I’ve documented here. I’ll post again if I solve this in a way I am happy with.

Now restart rsyslogd with service rsyslogd restart and hit the webserver that you’re monitoring the logs from. If you’ve left the stop out you should see the access log appear in your system log; if everything went well the log entry should have magically appeared in your elasticsearch server; elasticsearch provides a semantic well formed REST API; allowed easy querying of it from any system that can talk basic hypertext transport and some JSON.

curl -XGET ‘http://elasticsearch.int.pydev.space:9200/logstash-2016.08.21’

and you’ll get a JSON representation of the Index that has been created; or if things didn’t go so well an error about the index not existing.

Kibana

Now the data’s getting into elasticsearch; it’s time to checkout kibana to visualise and query your data in a nice userfriendly way. As I mentioned earlier; kibana and elasticsearch allow you to build your required security mechanisms around them and don’t get in the way; they use HTTP/HTTPS so they can be proxied; authenticated and secured in any number of standard ways – I encourage you to do this!

I’m installing kibana on one of my cluster nodes; you can install it on all of them or on your dedicated java application server; but for the purpose of this demonstration the first cluster node we built – galahad will do fine.

If you haven’t installed elasticsearch on the node; you’ll need to make sure you have default-jre-headless and the GPG key for the elasticsearch repositories along with apt-transport-https for pulling via https from the elastic apt repository. You can follow the steps outlined above in the elasticsearch section for these.

echo "deb https://packages.elastic.co/kibana/4.5/debian stable main" | tee -a /etc/apt/sources.list.d/elastic.list
apt-get update
apt-get install kibana

If you are installing on a host running as part of your elasticsearch cluster you don’t need to do anything further; otherwise you may need to specify the cluster URL in /opt/kibana/configs/kibana.yml; but otherwise you should be able to open the host’s port 5601 from your webbrowser (e.g. http://galahad.int.pydev.space:5601).

When you first enter kibana you’ll need to set up your first Index.

index

Kibana expects to receive standard logstash style indexes and timestamped data; so if all went well you should automatically be able to create the index pattern for kibana to use. You can use multiple index patterns in Kibana which means that you can separate various kinds of data or different environments’ data into different indexes as your use case may require. You can now use the discover tab and view your data in realtime (you can use the refresh and time options available at the top right) and a range of filters based on your data on the left;

In Conclusion

I strongly encourage you to get familiar with the Elasticsearch API and queries; but starting with kibana allows you to quickly and easily get a start on powerfully utilising your log data in ways you hadn’t thought of before. Be sure to check out the visualise tools as well for graphing your data.

There’s a whole range of tutorials and help available elsewhere on line for learning how to use kibana and it’s usage is far outside the intended scope of this document.

So while logstash and fluentd may be useful tools in collecting log data in many situations; sometimes we don’t need to go much further past the standard tools that are already part of the operating system. I’m a fan of lean minimalist systems; and having a full JRE running on servers not requiring it just to handle logging had previously turned me off some solutions; and fluentd seemed like overkill for many of the tasks I needed.

Finally, once you’ve got your head around accesslogs; start bringing other system logs in; you’ll find you don’t need to structure all the logs in the same way you do access logs as many logs will just be a stream of data.

My key things to remember:

If you bring in data from multiple systems to the same index; ensure that your data is adequately tagged with what system it’s from; so that you can pull out that data.
If you bring in multiple appications to the same index; do so because you want to use that data together. Make sure you can identify the application in your JSON data.
Timestamps! I can’t underestimate the importance of timestamps.
There’s nothing to fear from running elasticsearch yourself; and there’s no need to pay a fortune to a 3rd party vendor and ship them your logs either.