Tue Nov 1 17:41:21 EDT 2011

regexp -> pregexp

Apache log parser broke.  Trouble seems to be related to the use of
`regexp'.  Using `pregexp' at least some things work..

Overall the thing is that this is really too brittle and hard to
debug.  Let's just make something that reads one item at a time in a
more automatic way.  It seems quite straightforward to dispatch on the
first character:

  - "  string
  - [  date
  - ?  space-separated word

It works pretty well with just "read" but as I remembered that was
horribly slow, which was the main reason why I used regexps.  So let's
stick to that decision and find a way to debug better.

What I really want is composable regular expressions with named
variable binding (not position mapping).

Anyways, I fixed the bug in the match string and went on to use the
(test) routine, which filled up memory after 4meg lines, using
apparently about 1k per line.  That's a bit over the top..

Total number of lines is 4080622 which takes a couple of minutes to
parse.  Estimate about 25k lines/second.  That's fairly reasonable.
Now what about hashing it so the main table can be dumped out as a
table of indices so we don't need to let the database do this.

tom@zoo:~/plt/lib/x$ time mzscheme -it apachelog.ss -e '(begin(test)(exit))'

tom@zoo:~/plt/lib/x$ /opt/apache-logs/sorted | mzscheme -t apachelog.ss -e '(test)'
848597 date: (#(struct:exn:fail "find-secs: non-existent date (inputs: 0 17 2 13 3 2011)" #<continuation-mark-set>) #"00" #"17" #"02" #"13" #"Mar" #"2011")

Looks like a bug:
(find-seconds 1 17 2 13 3 2011)
find-secs: non-existent date (inputs: 1 17 2 13 3 2011)

Let's see if it's in current snapshot.

Still there.  Sent email to list.  See next post.

EDIT: was a daylight saving thing..