Getting to know (My)SQL For a brief summary, see this[1] post. [1] entry://20111111-104648 Entry: Queries Date: Fri Nov 4 00:19:08 EDT 2011 select date,ip,req from entry NATURAL JOIN ip NATURAL JOIN req where id_ip = (select (id_ip) from ip where ip = '201.39.162.161'); Entry: SQL cheat sheet Date: Fri Nov 4 00:35:12 EDT 2011 # apt-get install mysql-server # mysql CREATE database ; SELECT * from ; USE ; CREATE TABLE agent ( agent_id MEDIUMINT NOT NULL AUTO_INCREMENT, agent TEXT NOT NULL, PRIMARY KEY (agent_id) ); INSERT INTO agent (agent) values ('me'), ('my dog'); SELECT COUNT(*) FROM
; Entry: Regular expressions Date: Mon Nov 7 20:28:18 EST 2011 select * from ref where ref regexp 'ycombinator'; select count(*) from ref where ref regexp 'pic18'; [1] http://www.go4expert.com/forums/showthread.php?t=2337 Entry: Histograms Date: Mon Nov 7 22:38:51 EST 2011 Simple histograms need a "truncate" function like year, month, day below. Then the summary is a GROUP BY, optionally followed bya an ORDER BY. SELECT year(date) AS year, count(*) AS hits FROM entry GROUP BY year(date) ORDER BY year; SELECT month(date) AS month, count(*) AS hits FROM entry GROUP BY month(date) ORDER BY month; SELECT day(date) AS day, count(*) AS hits FROM entry GROUP BY day(date) ORDER BY day; [1] http://stackoverflow.com/questions/830814/sql-server-non-standard-date-based-histogram [2] http://stackoverflow.com/questions/6201992/optimal-way-to-create-a-histogram-frequency-distribution-in-oracle Entry: Known bots Date: Mon Nov 7 22:55:19 EST 2011 [1] http://www.user-agents.org/ Entry: The SQL notes Date: Fri Nov 11 10:46:48 EST 2011 I'm starting these notes as a SQL/DB beginner. I was never interested in databases or SQL until I read a chapter in CTM[1] a couple of years ago. There the relational model of programming is developed in a way that is closer linked to declarative & logic programming, a connection I never saw before but which caught my attention, being interested in language design, interpreters, compilers, ... Recently I re-read that chapter in [1] when a practical need has come up to write reports for a couple of years of Apache log files. At first, I attempted to write an ad-hoc database query engine in Scheme (Racket) but soon realized that what I wanted to do was perfectly possible in SQL. Hence the Scheme part is limited to a parser that translates a collection of log files to an .sql file that can be fed into MySQL to produce a DB schema. From there on the idea is to get to know SQL, to see what's possible and what's reasonably (fast) and possibly move back to Racket bindings for the reporting layer. [1] http://www.info.ucl.ac.be/~pvr/book.html [2] entry://../plt Entry: Regexp and apache logs Date: Sat Nov 12 10:06:09 EST 2011 I put all my logs together so I can share some information across vhosts, such as bots. Trouble is that I don't have vhost infor recorded for each entry which might complicate things a bit. Anyway, it seems that because requests need to be fully qualified, there is no real problem here. What might be the case is that I need some precomputed request sets that cache the regular expression matches for certain requests into separate tables or views. Hmm.. actually it's not. For zwizwa.be I can see: 255209 | GET /darcs/darcs/zwizwa.be/ HTTP/1.0 Entry: Views Date: Sat Nov 12 10:28:35 EST 2011 DROP VIEW v_entry; CREATE VIEW v_entry AS select * from entry NATURAL JOIN vhost NATURAL JOIN ip NATURAL JOIN stat NATURAL JOIN req NATURAL JOIN ref NATURAL JOIN agent; DROP VIEW v_blog; CREATE VIEW v_blog AS select * from req where req regexp 'GET (.*)/blog/\\?p='; Entry: Getting at wordpress blog posts Date: Sat Nov 12 11:20:38 EST 2011 This view is a subset of the req table which refer to blog posts. DROP VIEW v_blog; CREATE VIEW v_blog AS select * from req where req regexp 'GET (http://.*)*/blog/\\?p='; How to create a table from that that identifies per blog post? Entry: Dumping SQL inside mysql binary Date: Thu Feb 23 11:00:07 EST 2012 I'd like to do something that mysqldump does, but from inside the mysql binary. I have access to a remote (huge) DB table but I'd like to work on part of the table locally, downloading the data only once. Looking around online, the simplest seems to be a tab-separated value file (TSV). #!/bin/bash cat <out.tsv SELECT * FROM $TABLE where ...; EOF Entry: Expanding text field into table using regular expressions Date: Thu Feb 23 14:48:40 EST 2012 I have a table with a text field that has data in it that I want to extract into table. I.e. suppose the text field has 'AA:BB:CC' then I want to separate this out into 3 fields 'AA','BB',CC'. The way to do this is something like: SELECT fn(column) where ...; where the fn is a function that parses the string. Entry: Regexps Date: Thu Feb 23 15:39:38 EST 2012 Seems that regex_replace is quite a hassle in mysql, so let's just do it externally using sed. echo "Debug : Sender device 48859 - '357005032445536 ' - id 129CE8 - flags 0002 - data 85510100" echo "foo1" | sed -e 's/foo\(\.*\)/\1/g' Entry: Representatives Date: Thu Feb 23 23:55:52 EST 2012 I have a table of (x,y,z) where I'm mostly interested in unique (y,z) combinations. I know how to do unique single columns (group by) but does this also work with multiple columns? Entry: Query DB in bash Date: Tue Feb 28 15:06:29 EST 2012 What's the simplest way to map a bash function over a table? It turns out to be simplest to just put the function in a script and use xargs. However this makes maintaining state a bit hard. Bottom line: use a real programming language, or implement the procedure in straight SQL. Entry: Some things I learned these days Date: Wed Feb 29 09:40:07 EST 2012 Playing with MySQL logfiles for work. Note that I'm an amateur at this and some approaches might be convoluted. Basic idea: in bash it's simplest to work with TSV (tab separated value) files, iterate over them with a "while read RECORD; do ... ; done" and use the sed/awk/xargs duct tape tools. * Adding data to the database. Basically, just generate SQL from any record stream and pipe it into mysql command line client. Also useful for custom dumps (write a script that does TSV -> SQL conversion using "while read RECORD; do ... ; done". * Mirroring a database + regexp conversion. I wanted a sandbox to create more tables so didn't work in the read-only production DB. Copying databases can usually be done using "mysqldump". However I did not find a way to do a selective dump, so I use the TSV->SQL approach. * Put individual SQL queries in a .sql files and load them in the mysql command line client using the "source" command. * Use views to combine different tables. I.e. I could find 3 different kinds of data: - log files - log file annotation from constant data (i.e. ELF symbol tables) - manual annotation These are fundamentally different. The first is the original raw data, second is data derved from original raw data in a different form (ELF binaries in this case). The latter is data updated during the manual log processing, which is dynamic in nature and reflects the state of the current analysis. Entry: Database cursors Date: Wed Feb 29 10:04:32 EST 2012 So I've done queries 101 (data filtering), next is updates (for loops / database cursors). I want to run an UPDATE procedure for all the results in a SELECT. Looks like this is possible within the UPDATE clause itself: UPDATE table SET column = value, ... WHERE ... Though that does an iteration over the table itself (the iterated table is the same as the updated one). That's not what I want. I want to iterate over table1 and insert/update values in table2. for rec in table1; do if predicate(rec); then replace(table2, field, "value") end end Entry: Apache logs in database Date: Fri Mar 2 20:25:46 EST 2012 After de-noobing a bit at work this week, I'm thikning that there might be some simpler ways to do the key indirection I'm using a Racket script for. Basic problem: the logs are one big table, but the entries in the table have a lot of duplicates. Find a way to do automatic key generation for the entries: if an entry doesn't exist yet, make up a new key, otherwise reuse old. The problem is that while (I think) I have some grasp about the functional part of SQL (queries), I have no clue about the procedural ones (updates). The first thing is probably to find out how procedures work. Wait.. This can be done functionally by creating a view. ( As Gerard says: Can't you create a view? ) 1. Just add the entry. 2. Insert if not exists each column. 3. Create a view that joins the hashed/indexed entries ( Which is the *inverse* of the operation I'm using to build the indirect table in the first place! ) Then of course, this can be done simpler. 1. INSERT IF NOT EXISTS each entry into its respective table. 2. SELECT to find id 3. insert main entry Entry: MySQL procedures Date: Fri Mar 2 20:48:09 EST 2012 DELIMITER //GetAllProducts() CREATE PROCEDURE GetAllProducts() BEGIN SELECT * FROM products; END // DELIMITER ; From [1], the delimiter is necessary to be able to store multiple statements in a procedure. (weird!). So simple sequencing is simple. Next: variable binding. First thing that comes to mind: what's the difference between procedures and functions? DELIMITER // CREATE PROCEDURE addsome() BEGIN insert if not exists SELECT * FROM products; END // DELIMITER ; [1] http://www.mysqltutorial.org/getting-started-with-mysql-stored-procedures.aspx [2] http://dev.mysql.com/doc/refman/5.0/en/create-procedure.html Entry: Playing with INSERT Date: Sat Mar 3 09:15:21 EST 2012 DROP TABLE if exists foo; CREATE TABLE foo( id MEDIUMINT NOT NULL AUTO_INCREMENT, val TEXT, PRIMARY KEY (id) ); INSERT IGNORE foo (val) VALUES ("lalala"); From [1]: The logic is that the SELECT statement generates a single row of data with the required values, but only when the values are not already found. That wasn't immediately useful. Dicking around a bit I come to this: INSERT INTO foo(val) (SELECT (val) from foo WHERE NOT EXISTS (SELECT * FROM foo WHERE val = "lalala")); INSERT INTO foo(val) (SELECT "lalala" from DUAL WHERE NOT EXISTS (SELECT * FROM foo WHERE val = "lalala")); Turning this into a procedure doesn't work.. I asked Yvonne, she probably knows. [1] http://stackoverflow.com/questions/913841/mysql-conditional-insert Entry: Understanding SQL Date: Sat Mar 3 09:39:33 EST 2012 I still don't quite get it.. Especially not the single value vs table stuff. Representing tables as lists, in Haskell syntax: f : (a,b,c) -> (d,e) g : [(a,b,c)] -> [(e,e)] h : [(a,b,c)] -> (d,e) Here g = map f, where this mapping seems to happen automatically in SQL depending on context? And h is a projection from a table to a single record, something like COUNT or EXISTS. COUNT is called an aggregate function. First find of the day: SELECT without FROM returns a 1 element table: mysql> SELECT 123; +-----+ | 123 | +-----+ | 123 | +-----+ 1 row in set (0.00 sec) which is enough to pass a value to a function that expects a table. This is useful for testing out predicates, i.e.: SELECT ("lalala" IN (select (val) from foo)); SELECT (EXISTS (select (val) from foo where val = "lalala")); "SELECT ... FROM " requires a table reference. It's possible to abstract queries as tables (views). How to add a query directly? In SqlServer this requires an alias[1]. Also works in Mysql: SELECT * from (SELECT * from foo) alias1; How to save the result of a query as a table? This is SELECT ... INTO, but that's not standard. Mysql supports INSERT INTO[2] which requires an existing table. [1] http://bytes.com/topic/sql-server/answers/142990-select-subquery [2] http://dev.mysql.com/doc/refman/5.0/en/ansi-diff-select-into-table.html Entry: MySQL regexp replace or just regexp match Date: Sat Mar 3 13:14:27 EST 2012 It seems that I need access to regexp replace, but on second thought there isn't really any "real" datamining going on (no processing on the data embedded *inside* a REFERRER or a REQUEST field: the embedded data just serves as input to predicate functions used for selecting and grouping. This seems like a very important point. It seems plausible to think that this is the reason why MySQL has matching but not replacing: don't store what can be computed. Some useful views: - google searches - specific blog page hits - Though... I have some trouble with things like this: GET /blog/?p=1089&cpage=1 HTTP/1.0 GET /blog/?p=1089 HTTP/1.0 GET /blog/?p=1089 HTTP/1.1 How to project them down to one representative? Maybe it's a good idea to parse the REQ and REF fields a bit more. Looks like I do need a regexp/replace function, but probably can do without saving tables (i.e. keeping just views). Entry: Regexp-replace Date: Sat Mar 3 16:19:58 EST 2012 sudo apt-get install libpcre3-dev libmysqlclient-dev Get library source[3], untar and ./configure && make && sudo make install Libraries have been installed in: /usr/lib/mysql/plugin [1] http://www.mysqludf.org/lib_mysqludf_preg/ [2] http://stackoverflow.com/questions/986826/how-to-do-a-regular-expression-replace-in-mysql [3] http://www.mysqludf.org/lib_mysqludf_preg/lib_mysqludf_preg-1.0.1.tar.gz Entry: External processing Date: Sat Mar 3 16:29:34 EST 2012 Considering the size of the tables, it might make sense to just split them externally using regexps. Or even to do this in the preprocessing step directly. mysql> select count(*) from req; +----------+ | count(*) | +----------+ | 354134 | +----------+ 1 row in set (0.00 sec) mysql> select count(*) from ref; +----------+ | count(*) | +----------+ | 42338 | +----------+ 1 row in set (0.00 sec) Anyways, I installed the PCRE wrapper for mysql and I find that the next step is to weed out the pletora of hack attempts ;) I found there are about 30 ways to request the same page, most of them look like wordpress breakin attempts. Entry: Nested queries Date: Sat Mar 3 20:20:25 EST 2012 So.. With the regular expression matcher it's possible to create partitions in the set of requests (i.e. projecting onto blog number). (remember from math class: a projection defines a partition!) So what I want to do now is to map each request to a blog number, and count the number of occurrences. select * from entry natural join req_blog # project request to id_blog group by id_blog order by id_blog limit 10; Queries start to take a little longer. Entry: Caching predicates / membership Date: Sat Mar 3 22:20:22 EST 2012 Products (subqueries, joins) are expensive. However in many cases the subquery is an "IN" constraint. Isn't it simpler to cache this membership function into an explicit (t, bool) table and join with that?