Fri Jul 3 09:26:12 CEST 2009

apachelogs - - [12/Apr/2009:00:47:02 +0200] "GET /shop_content.php?coID=47&XTCsid=v6ldqlfhml4a9vv90igsmbj6sis9n77u HTTP/1.0" 404 16 "-" "msnbot/2.0b"

The current parser chokes on this input, and I don't understand why.
Probably best to use regular expressions.

Ok, that seems to work fine.  I've created 2 parsers, for combined +
common log formats.

parsing takes _way_ too long.. 

Ok, i tried to convert it to scheme reader format and it's even worse.
I guess that's why one would use databases: the data is parsed.

So, how does one go about managing such datasets?  One thing is that
these logs are very redundant.  Simply indexing every field using
32bit integers would already do wonders.  This would allow the graph
structure to be represented separately from the node data.

What am I interested in?
  - date
  - ip
  - referrer
  - agent (for bot filtering)
  - request