Thu Jul 9 12:24:37 CEST 2009
So.. This is the first thing I've dealt with in a long time where
performance really matters. Reading the database from disk in text
format, either directly from the Apache logfile syntax or preprocessed
into scheme syntax is really too slow. Some other database storage
mechanism needs to be used.
Now.. Can standard methods be used? Using an sql database with proper
shared data might really be enough. The basic idea is to have the
graph structure represented in a form that is easily accessed. In the
end, it is nothing but a bunch of integers.
So I wonder: does using standard methods have _any_ advantage here?
One would be external tools could use the database.
Let's get the numbers right first. How many table entries for a
year's worth of logs?
17.7M bzip2, one file
25.6M gzipped, one file / multiple files
What strikes me is that zcat is very fast reading the files.
tom@zni:/opt/apache-logs/access$ time bash -c 'zcat * | wc -l'
Why is parsing in the current implementation so incredibly slow? I
guess I'm seriously underestimating regex complexity.
Anyways, what's needed is:
20 bytes per row that's about 20MB of indexed data = quite close to
the compressed sizes!
Ok. Regex. Maybe the main problem is that my expressions are not
anchored? No.. I feed it lines.
Hmm.. Let's start encoding it into something that loads fast. It
looks like a specific structure for representing the logfile is going
to be way more interesting than a dumb regexp based approach.
Especially since requests themselves are so structured. Basic idea:
- turn the logfile into a graph with as little redundancy as
possible. this will show the "real" structure of the data.
basicly: define what a "token" is.
- write queries on the graph structure