Mon Oct 31 17:18:46 EDT 2011

DB normalization

For abstracting my Apache logs I'd like to do something like this:

  (url, ip, client)

but represent it as this:

  (idu, idi, idc)

  (idu, url)
  (idi, ip)
  (idc, id-client)

  1. many of the urls & ips & clients will be duplicated
  2. i'm interesting in the set of unique url, ip, client, ...
  3. implementation: does it search faster, use less storage?

Is this non-normalized?  I'm not sure, though putting them back
together is a join[1]:

  If columns in an equijoin have the same name, SQL/92 provides an
  optional shorthand notation for expressing equi-joins, by way of the
  USING construct:

  FROM   employee
         INNER JOIN department
            USING (DepartmentID);

Thinking about this a bit, it makes sense to replace fields with IDs
in this way for the reasons above whenever the only useful operation
on that field is just equality.  Equality distributes over keys.
Other fields like data, number of bytes, ... that support other
operations (i.e. range) should stay in the main table.

[1] http://en.wikipedia.org/wiki/Join_(SQL)