Tue Nov 1 17:41:21 EDT 2011
regexp -> pregexp
Apache log parser broke. Trouble seems to be related to the use of
`regexp'. Using `pregexp' at least some things work..
Overall the thing is that this is really too brittle and hard to
debug. Let's just make something that reads one item at a time in a
more automatic way. It seems quite straightforward to dispatch on the
- " string
- [ date
- ? space-separated word
It works pretty well with just "read" but as I remembered that was
horribly slow, which was the main reason why I used regexps. So let's
stick to that decision and find a way to debug better.
What I really want is composable regular expressions with named
variable binding (not position mapping).
Anyways, I fixed the bug in the match string and went on to use the
(test) routine, which filled up memory after 4meg lines, using
apparently about 1k per line. That's a bit over the top..
Total number of lines is 4080622 which takes a couple of minutes to
parse. Estimate about 25k lines/second. That's fairly reasonable.
Now what about hashing it so the main table can be dumped out as a
table of indices so we don't need to let the database do this.
tom@zoo:~/plt/lib/x$ time mzscheme -it apachelog.ss -e '(begin(test)(exit))'
tom@zoo:~/plt/lib/x$ /opt/apache-logs/sorted | mzscheme -t apachelog.ss -e '(test)'
848597 date: (#(struct:exn:fail "find-secs: non-existent date (inputs: 0 17 2 13 3 2011)" #<continuation-mark-set>) #"00" #"17" #"02" #"13" #"Mar" #"2011")
Looks like a bug:
(find-seconds 1 17 2 13 3 2011)
find-secs: non-existent date (inputs: 1 17 2 13 3 2011)
Let's see if it's in current snapshot.
Still there. Sent email to list. See next post.
EDIT: was a daylight saving thing..