Thu Jul 9 12:24:37 CEST 2009

apache logs

So..  This is the first thing I've dealt with in a long time where
performance really matters.  Reading the database from disk in text
format, either directly from the Apache logfile syntax or preprocessed
into scheme syntax is really too slow.  Some other database storage
mechanism needs to be used.

Now.. Can standard methods be used?  Using an sql database with proper
shared data might really be enough.  The basic idea is to have the
graph structure represented in a form that is easily accessed.  In the
end, it is nothing but a bunch of integers.

So I wonder: does using standard methods have _any_ advantage here?
One would be external tools could use the database.

Let's get the numbers right first.  How many table entries for a
year's worth of logs?

291 logfiles
1065149 entries

206M   uncompressed
17.7M  bzip2, one file
25.6M  gzipped, one file / multiple files

What strikes me is that zcat is very fast reading the files.

tom@zni:/opt/apache-logs/access$ time bash -c 'zcat * | wc -l'

real	0m1.147s
user	0m1.072s
sys	0m0.072s

Why is parsing in the current implementation so incredibly slow?  I
guess I'm seriously underestimating regex complexity.

Anyways, what's needed is:

date     32
ip       32
request  32
referrer 32
agent    32

20 bytes per row that's about 20MB of indexed data = quite close to
the compressed sizes!

Ok. Regex. Maybe the main problem is that my expressions are not
anchored?  No.. I feed it lines.

Hmm.. Let's start encoding it into something that loads fast.  It
looks like a specific structure for representing the logfile is going
to be way more interesting than a dumb regexp based approach.
Especially since requests themselves are so structured.  Basic idea:

  - turn the logfile into a graph with as little redundancy as
    possible.  this will show the "real" structure of the data.
    basicly: define what a "token" is.

  - write queries on the graph structure