Experiments with the PLT Scheme webserver. The stable part of this code supports the http://zwizwa.be/ramblings log file formatter, used for most of my code projects. It is based on a simple graph based document model with lazy parsing. Additionally, I'm thinking about and experimenting with a "PHP for Scheme" model for low-level parametric page generation, memory image based web application development (without database, only using store for snapshot backups) and what the role of objects are in a continuation based web server without database. Darcs archive: http://zwizwa.be/darcs/sweb Entry: fun with the plt web server Date: Sat Sep 8 14:59:18 CEST 2007 the stuff i got running now is a website generated from a single xml file with content. this works pretty well, but doesn't really require any dynamic stuff. next: ramblings.txt parser this uses streams.ss, a copy paste from the brood project. Entry: ramblings Date: Wed Sep 12 01:47:46 CEST 2007 parsing a ramblings file seems to be pretty straightforward. need to fix some small organization problems for 'wikifying' my syntax. i guess i've been able to avoid most parsing problems using lisp and forth.. time to read some theory. anyways. one important node for indexing is to make the article url stable, such that if i edit the original ramblings file, all references stay unless the article is deleted. the only thing that's reliable is the date. since i'm using standard 'date' output, this can be parsed back into some simpler date rep. just was watching http://video.google.com/videoplay?docid=2159021324062223592 and one of the remarks david weinberger is making is that it is simply too expensive to delete everything. he also mentioned some 'meta-wikipedia' where experts link to a subset of frozen articles that they can endorse. made me thing of these ramblings files. instead of cleaning them up, i can just make a decent article here and there and archive the lot. then if i want to link to some historic thought, i just can.. that's why i do need a proper indexing system. the exact dates are probably good enough. i don't even need a project name. in fact, i can index everything i run into by standard date encoding.. maybe that's the 'pool' i've been thinking about? Entry: ramblings ok Date: Sun Sep 16 14:26:55 CEST 2007 i think i got most of it working now, based on - simple http:// link highlighting - indexing through date string (using ad hoc parser for 'date' cmd output) - nodes stored in a hash in memory: left-right linked + indexed. looks +- elegant, except for the use of some global vars. next: - put this online, maybe add another level of abstraction: per project ramblings - make a parser for pawfaliki format - create new site about purrr docs remember: the idea is to have a STABLE link structure. i prefer never to change the linking, which will be: http://zwizwa.be//.... Entry: scheme! Date: Mon Sep 17 14:49:32 CEST 2007 the funny thing is that now i'm used to pure functions, working with mutation looks terribly dirty. i run into this with the web server's data structure. building it seems strange: explicitly dealing with inconsistent intermediate states is something you don't really miss. and persistence is so convenient.. the most conventient structure for me is a hierarchical structure of hash tables (file system), which is further cross-referenced according to need. so: - hierarchical tree -> everything is reachable - cross referenced graph -> some things can be accessed fast this brings me to the following: why not stick with functional programming? the reason is of course that functional data structures are not trivial once you go beyond simple lists. but. is it really so hard to have an efficient hash? and, given the problem, isn't a mere association list enough? let's go back to a simplistic view. the things i need to aim for are concurrency and persistence. both are a lot easier when functional data structures are used. so the sloganesk view: - functional data structures for concurrency and persistence (data includes code here.. what is data other than dumb code?) - all mutation is hidden in caching mechanisms or declarative abstractions. Entry: real parser Date: Thu Sep 20 23:12:27 CEST 2007 first thing to find out is that an ad-hoc syntax likely requires separate parser/lexers. for example: i'm using a header/body separation in ramblings.txt which have completely different lexemes. so... i can leave the preprocessor as is. it sort of works, but i get some errors i don't understand. time to call it a day. Entry: pawfaliki Date: Fri Sep 21 15:53:06 CEST 2007 it looks like the wiki grammar is regular, so i could do with a lexer only. i need to be careful though not to turn this into a big pile of hacks.. it's already a small pile. about parsing: i've been largely able to ignore this because 1. forth is regular 2. all other ad-hoc things i've used were also regular 3. the only other thing i use is s-expressions (with a simple recursive decent parser) need to read a bit more about LR grammers. currently i don't understand how to resovle the shift/reduce conflict in the ramblings parser. Entry: graphs Date: Mon Sep 24 16:27:20 CEST 2007 plt scheme has a graph.ss library. there is some talk about it on the plt list: http://groups.google.com/group/plt-scheme/browse_thread/thread/391b57f756c75678 hmm.. it's mrlib. Entry: a new wiki standard Date: Fri Sep 28 14:01:20 CEST 2007 basicly what i want is just one format to concentrate on for development of high--quality texts. i already have that, and it's called latex. especially together with tex2page, which includes scheme as a meta--langage, there the sky is the limit. Entry: web server debug Date: Sat Oct 6 17:28:15 CEST 2007 i need to change some things so debugging becomes easier. - data structure reload without explicit sync (plt file date) Entry: parsing Date: Sun Oct 7 13:51:40 CEST 2007 hmm.. it's quite a mess, because the syntax is so ad-hoc. the thing is i need several lexers and parsers, instead of one. Entry: speedup Date: Sun Jan 27 17:01:40 CET 2008 i don't know exactly what the problem is, but it's REALLY slow. it looks like parsing the articles can better be postponed until when they are actually accessed. fixed: adding a delay around the text->xexpr translation does the trick. the index loads pretty fast now. TODO: hide this delay in object accessors. Entry: forth as markup Date: Wed Jan 30 11:41:28 CET 2008 hmm... in the ramblings body parser i had made a reservation for the ( and ) characters to be able to handle s-expressions. but, given that it's text, it might be a lot easier to use a forth-style approach: words separated by spaces, most words are just text, but some reserved words are used for formatting.. Entry: @chunk broken Date: Thu Jan 31 14:40:49 CET 2008 i took that out of brood.. so i need to find a replacement. the reason for taking out @chunk in brood was the move to delimited parsing, which means no "put-back" or no "peek". Entry: move to plt v4 Date: Tue Feb 19 16:33:00 CET 2008 EDIT: just changing plt-web-server-text to plt-web-server in the sweb/start script did the trick. i don't know what ghosts i saw last week that made it fail.. some things changed in the implementation of the webserver: the old sweb version doesn't work any more. a good occasion to fix some things. what i'd like to have is a repl attached to the server to inspect data structures while it's running. but first, start with a working server that's started entirely from a script. there's /start.ss as a first attempt. (enter! "start.ss") loads the start module when in a repl. the first questions: * how to create a servlet dispatcher? * what to do with the config dir? i.e. the ordinary file server? i'm reading a bit on the history of PLT scheme, and it looks like it's probably best i switch to learning mode. get it to know a bit better.. after briefly reading about delimited continuations and mixins, i do wonder how to use them to implement the current graph-based web-app structure a bit better. continuations OR objects, or continuations AND objects? time to get practical though.. not ready for theory atm. the procedure -> dispatch lifter http://pre.plt-scheme.org/docs/html/web-server/dispatch-lift_ss.html looks like that's what i need if i don't want servlets. Entry: plt 4 problems Date: Wed Feb 20 14:25:59 CET 2008 tom@giebrok:~/sweb$ ./start /usr/local/bin/plt-web-server -p 8181 -f configuration-table "../web-config-unit.ss" broke the contract (-> path-string? #:listen-ip (or/c false/c string?) #:make-servlet-namespace (->* () (#:additional-specs (listof any/c)) namespace?) #:port (or/c false/c number?) unit?) on configuration-table->web-config@; expected a procedure that accepts 1 ordinary argument and the mandatory keywords #:listen-ip #:make-servlet-namespace #:port, given: #web-config@> === context === /usr/local/mz-3.99.0.12/collects/scheme/private/contract-guts.ss:200:0: raise-contract-error on zzz with plt-3.99.0.12 it doesn't give this error. maybe plt vs. mz packaging? i sent a mail to the plt list. Entry: /last servlet Date: Wed Feb 20 18:27:32 CET 2008 http://zwizwa.be/last/ is a redirect point for cached dynamically generated files, such ad .pdf files from .tex documents, and tarballs from the darcs source projects. the main goal is to have a fixed entry point (service) that provides these documents. ideally, this would be demand-driven, but probably best to just use cron jobs now. next action: figure out how to redirect in scheme. http://pre.plt-scheme.org/docs/html/web-server/response-structs_ss.html Entry: roll your own stream processing library Date: Wed Feb 20 20:28:39 CET 2008 i'm wondering if this is actually a good idea.. in brood, it's a core component (syntax streams), but maybe it's better to use something standard? the reason is: i was apparently the first one running into a bug with SRFI-45.. so what's the popular solution? is there a canonical one + operations on those streams?? Entry: sweb contains data Date: Thu Feb 21 16:13:40 CET 2008 the idea was to separate data from functionality, but i guess there's no point, since funcitonality is so specialized. what might happen later is that i spin off some scheme web facade library for other projects (including the stream stuff..) but that's not for now. CONVENTION: data used for dynamic content specific to the site sits in the db/ directory Entry: darcs meta? Date: Thu Feb 21 16:15:56 CET 2008 what about serving the darcs archives through sweb, and providing some standardized frontend + making all projects comply to a standard building process? Entry: separating out library code Date: Thu Feb 21 17:07:23 CET 2008 instead of copy-pasting or ad-hoc linking of scheme files between different projects it might be better to start working on my "personal" scheme code library, merely as a practical way of organizing different projects. atm, there is: [BROOD] -> [LIB] <- [SWEB] let's call it zwizwa-plt and have it live in collects/ Entry: web server debugging Date: Thu Feb 21 18:30:47 CET 2008 http://groups.google.com/group/plt-scheme/browse_thread/thread/8cab8d9142780cb9 Entry: classes in sweb Date: Fri Feb 22 11:02:56 CET 2008 what about moving the 'graph' thing to a class? if i'm to use objects, a page should clearly be an object. Entry: zwizwa cache cleanup Date: Fri Feb 22 11:05:22 CET 2008 let's try to move to automate the generation of cached objects: * papers * darcs src + bin tarballs first, the papers. what needs to be done is: 1. check consistency of document 2. if not consistent, initiate a compilation (on different host!) 3. retreive the result security-wise, the webhost can only send messages to the compilation host. the compilation host responds to this by placing a file in a cache directory (or exporting one?) toplevel archive organization. is the builder part of sweb? yes. so it's best to put it in one archive. let's standardize the builder as a Makefile. builder takes archives from giebrok. the builder runs parameterized with the source host's name, fix it in the makefile for now. Entry: caches Date: Tue Feb 26 17:08:06 CET 2008 the thing about call-by-need and makefiles is probably best seen as a filesystem based object cache. i already have something like that: cached-file-object let's put it in a separate module, since it might be useful. so.. this is not completely the same as 'make': since Makefiles have rules that are opaque, the only way to figure out wether something needs to be recompiled, is to re-run make. what might be interesting, is a sort of rate-limited make: call make only if last pull was >> than the time it takes to run make. it looks like these things are better solved in the archives themselves. an 'update' target in the toplevel makefile should perform all the necessary operations to sync from central darcs and recompile. also generally useful. ok.. i separate it out in 2 targets: update: sync with authoritative zwizwa.be archive all: build everything both are polling operations. one wonders wether they can be made synchronous. another problem that pops up is for latex: building latex requires multiple iterations per file to ensure convergence. how to do that automatically? -> fixed (see latex-bibtex in darcs/papers) Entry: ssh + scheme? Date: Tue Feb 26 20:42:35 CET 2008 maybe it's time to write a dispatcher for ssh logins. rpc over SSH. this can then be used to send messages between remote hosts over secure but limited channels. this needs only parsing of arguments + environment variables: getenv putenv current-command-line-arguments ok. it's pretty simple. making sure to use the right identity file: ssh -i -o "IdentitiesOnly yes" .ssh/authorized_keys can be configured to dispatch the identity to a program which interprets the environment variable SSH_ORIGINAL_COMMAND, a string without zero bytes. so glueing together scheme programs over ssh is really simple. this basicly solves the problem of building stuff on klimop: only allow a single command (i.e. 'build') that produces a tar file with the result. no other functionality is needed, other than the trust in klimop. Entry: no objects? Date: Wed Feb 27 16:52:50 CET 2008 i cleaned up the code a bit, and hid the forcing/cache triggering in the node/graph object. now it looks reasonably nice, with the main operations on the data struture being 'n!' and 'n@'. adding nodes is possible using 'delay' or 'dynamic'. (the latter produces just a tagged thunk). all promises/thunks are forced whenever a node is accessed. maybe this graph is good enough as data structure? sweb is just a linked frontend to a couple of dynamic or cached pages. in fact. that's really most of an object structure, where object methods do not take parameters. (a 'button' object). Entry: continuation experiments Date: Wed Feb 27 19:51:57 CET 2008 /usr/local/bin/plt-web-server -p 8181 -f configuration-table Servlet (@ /servlets/ct) exception: "current-continuation-marks: no corresponding prompt in the continuation: #" === context === /usr/local/mz-3.99.0.13/collects/web-server/lang/abort-resume.ss:132:0: send/suspend ...r/private/servlet.ss:32:19 ...r/private/servlet.ss:32:19 i suspect this has to do with the expiration based manager, see the note here: http://pre.plt-scheme.org/docs/html/web-server/none_ss.html#(def~20((collects~20web-server~20managers~20none..ss)~20create-none-manager)) "if you are considering using this manager, also consider using the Web Language. (See Web Language Servlets.)" that's not it. it looks as if the default loader doesn't do web language servlets.. ha, i got something working here. after looking at the source mz/collects/web-server/dispatchers/dispatch-lang.ss i found that a path contains 2 values: a path and a search path list, so i made my fake path generator generate an empty list as 2nd value. now the dispatcher works. apparently web-serverl/lang/stuff-url.ss stores the continuation in a file in ~/.urls looks like this stuff is still quite experimental.. it does work, though i'm not sure why these continuations need extra data. thought there was no server state? Entry: scheme/control Date: Thu Feb 28 12:24:14 CET 2008 using 'prompt' and 'abort' for 404 escapes. currently i use call/ec + a parameter. using a specific prompt tag should simplify this. indeed. works perfectly. Entry: wowri + sweb update Date: Sat Mar 29 12:34:30 EDT 2008 sweb: link to current scat rambligns wowri: figure out where to put the data. sweb code should be on giebrok, but all the site data, managed by different people, should go on kurk. big choice: use sweb or just use wordpress or other cms? since this is not just for melissa, it's probably best to stick to something popular to limit the support cost.. what's important in a wiki - themes + customizable design - data backend for backups maybe i should go for a ruby on rails wiki, to get a chance to jump on that train? i'd like to do stuff in the scheme webserver, but that creates more dependencies.. maybe check on planet for a wiki? checked with the boss: she's the only one updating, so i can go on experimenting. was thinking about using scribble instead of xml.. Entry: 2 servers Date: Sun Mar 30 10:32:51 EDT 2008 going to try something with 2 servers * giebrok: has sweb running * kurk: has data what i need is a way for giebrok to get data from kurk, but with modification time so it can rebuild its caches appropriately. i tried nfs readonly but i can't get that to work (mount gives permission denied). http objects have mtime, so i need a way to access that from scheme. maybe it's easier to just roll a quick file server that gives mtime data. ok.. got something working. but do i really need tcp servers? probably easier to use ssh + a single script that reads arguments from the SSH_ORIGINAL_COMMAND. what solves what problem? ssh: authentication + security program: data I/O it's best to have a daemon to eliminate startup time. unix socket daemon with socat? Entry: server channels Date: Sun Mar 30 14:16:24 EDT 2008 it's best to go back to point-to-point message qeueus. those have the simplest properties to integrate. all the nuts and bolts can be solved on system level (ssh/socat). send binary data as efficient as possible. this means to prefix it with a header. i've had enough trouble with quoting/unquoting strings between different lisps now. cons atom a atom b cons atom a cons atom b null cons binary 5 xxxxx null made the simple server thing, split into 3 parts: * tcp server * function -> interpreter * path access probably the path access is better with a guard. currently, it's a bit limited. Entry: browser events Date: Mon Jun 2 14:20:14 CEST 2008 an annoying problem is two way synchronization of generated documents and a browser: * browser should be notified whenever a source document changes so that it can refresh the doc. * whenver a browser refreshes / requests a doc, the server checks if the cache (compilation) is still valid. i'd like to solve this problem for serving Scribble documentation. it can probably be solved with a bit of javascript. the main question is how to move events from server -> client? something i never understood properly... also: ramblings needs rss support. Entry: taking out stream.ss Date: Sun Aug 10 19:36:58 CEST 2008 The stream lib seems to complicated. It's only used in parsing the ramblings files into a stream, while a list would do just fine. The lazy part is implemented using an explicit delay around the article parser anyway. Going to try a local fork that simply deletes the stream.ss files and works up from there. Deleted split.ss Then moved from there. Most adaptation was in parse-ramblings.ss but quite straightforward to solve. Entry: images and attachments Date: Sun Aug 10 22:40:53 CEST 2008 I'd like to add images and other attachments. The main problem is where to put them. Currently, all is contained in a single ramblings file, which is quite convenient. All ramblings files are part of a darcs archive, maybe they should use just relative directories? Relative is good. Part of darcs archive is too. Entry: blogging software.. Date: Sun Aug 10 23:01:17 CEST 2008 ai ai ai.. she's not happy with Wordpress any more! So.. What is necessary? * login * posts + reply * spam filtering * rss I'd like to get rid of MySQL and PHP too.. So an alternatives using sqlite: platform python+django http://en.wikipedia.org/wiki/Byteflow perl http://en.wikipedia.org/wiki/Movable_Type python/zope http://en.wikipedia.org/wiki/Plone_%28software%29 python/django http://en.wikipedia.org/wiki/PyLucid ruby/rails http://en.wikipedia.org/wiki/Radiant_%28software%29 ruby/rails http://en.wikipedia.org/wiki/Typo_%28software%29 ntry: apache logs Date: Mon Aug 11 08:48:37 CEST 2008 Apparently the standard debian install has a 'read' able apache log. Wrote some stuff on top of it, to get unique ips and user agents. It is rather slow.. Maybe find a better way to represent it? So, what to do with it? Add some filters, like bots. I'm not really interested in those. Let's just grep for 'bot' first. Entry: forms + login Date: Wed Aug 13 20:09:47 CEST 2008 Let's try to figure out how to do this. Login using some special cookie (no passwords) and edit a site's content. Entry: instaweb Date: Wed Aug 13 20:38:47 CEST 2008 Maybe switch to instaweb, since I'm only interested in running a single servlet. On the other hand, it might not be necessary. What does it bring? Simplified configuration + 'root' servlets. It's easy to add next time I add something: #lang scheme/base (require (planet schematics/instaweb/instaweb)) (instaweb #:servlet-path "servlets/ramblings" with some minor tweaks in the url handling code. EDIT: this won't work. the approach i take is to make a single webserver for all networked applications, and run it behind apache to expose only public servlets. Entry: disorganized Date: Sun Aug 17 15:45:12 CEST 2008 I'd like to make some 'stuff' available through my intranet. What this 'stuff' is mostly about location of files and terminals, and trying to organize it all a bit better. Entry: Electric project pages Date: Sun Aug 17 16:00:26 CEST 2008 For example, the Staapl homepage could use some parametricity. What I need is a mechanism to format pages from darcs projects. Let's call it prj-www, and let's use a Scribble based frontend. It's quite straightforward. However, a non-interpreted language using only scheme/reader is still tedious: funcitonality needs to be provided through an interpreter. Let's use one of the scribble languages that expands to 'doc. The approach I'm trying now is 'executable xhtml' : the servlet frontend will load the module into a namespace and strip all the scribble markings, leaving just the s-expressions generated by the code. This leaves the responsability to generate proper xhtml at the servlet side. It's a bit raw ``PHP style'', but somehow makes sense for quick & dirty apps. Ok. Formatting one document (staapl homepage) it's pretty clear that this is a bit too much hassle. Better switch to the structure imposed by scribble, and render it. Entry: continuations Date: Tue Aug 19 11:22:25 CEST 2008 Ok, it seems quite straightforward. Leaving the continuation management at the default, here's an example: #lang scheme/base ;; -*- scheme -*- (require scheme/pretty web-server/servlet web-server/servlet/web) ;; plt servlet interface (provide interface-version timeout start) (define interface-version 'v1) (define timeout +inf.0) (define (start req) (for ((i (in-naturals))) (send/suspend (page i)))) (define ((page i) url) (printf "~a\n" url) `(xhtml () ,(format "Page ~a. " i) (a ((href ,url)) "[next]"))) So, when to use continuations, and when to use objects? I.e. a shopping cart is an object: it should collect items from different parallel threads. Entry: web server tutorial Date: Wed Aug 20 19:48:51 CEST 2008 http://docs.plt-scheme.org/continue/index.html Entry: databases Date: Fri Aug 29 22:09:20 CEST 2008 I think I'm simplifying things too much.. A significant service a database provides is a guarantee of consistency: A database transaction, by definition, must be atomic, consistent, isolated and durable. These properties of database transactions are often referred to by the acronym ACID. http://en.wikipedia.org/wiki/Database_transaction In relation to persistence, this is not trivial. Entry: snooze Date: Tue Sep 2 19:56:09 CEST 2008 http://planet.plt-scheme.org/display.ss?package=snooze.plt&owner=untyped http://planet.plt-scheme.org/package-source/untyped/snooze.plt/2/1/planet-docs/snooze/index.html tom@sornfit:~/sweb$ mzscheme lib/db.ss date-test.ss:67:43: compile: unbound variable in module in: make-srfi:date setup-plt: error: during making for /untyped/unlib.plt/2/5 (unlib) setup-plt: date-test.ss:67:43: compile: unbound variable in module in: make-srfi:date Entry: OpenDocument Date: Fri Sep 5 15:57:31 CEST 2008 Trying to typeset Marycela's book. Making some routines to process .fodt XML files. Load and save work. This could save as a nice frontend for Melissa's writer website. Instead of asking people to create HTML files, maybe they should just stick with wordprocessors? This looks like it's doable, if the style used is a bit restricted. Marycela uses a lot of `space and tab' formatting, which is difficult to recover properly. Let's try to artificially create some documents and see how they look in XML. Hmm.. Coming from the structured documents side, it seems odd that text is not structured. Text is a sequence of paragraphs, where some paragraphs might be tagged with a "Heading" style. This is then collected to make table-of-contents etc.. .odt is a JAR (ZIP) file. Apparently the flat XML is not the standard one. And paragraphs and headers are separated: 'text:p and 'text:h tags. ZIP is supported in PLT Scheme through dherman/zip http://planet.plt-scheme.org/display.ss?package=zip.plt&owner=dherman Works like a charm: (define (load-odt filename) (unzip-entry filename (read-zip-directory filename) #"content.xml" (lambda (name dir port) (xml->xexpr (document-element (read-xml port)))))) Entry: guessterpreter Date: Mon Sep 8 01:04:26 CEST 2008 I'm done hacking my way through the xml file to convert it to something with minimal markup. However, current structure allows only for one tag / style, while a style should be modeled as an expression transformer. OK. Works. Next step: figure out how to retreive indentation information + add a parsing step to recover stanzas. Entry: 4.1.4 Date: Mon Mar 9 12:36:53 CET 2009 Startup breaks again with update to 4.1.4 This time it's a strange error procedure application: expected procedure, given: #f; arguments were: "/servlets/ramblings" #f #f === context === /usr/local/plt/collects/scheme/private/contract-arrow.ss:1347:3 /usr/local/plt/collects/scheme/private/contract-arrow.ss:1347:3 /usr/local/plt/collects/scheme/private/contract-arrow.ss:1347:3 /usr/local/plt/collects/scheme/private/contract-arrow.ss:1347:3 /usr/local/plt/collects/web-server/dispatchers/dispatch-passwords.ss:33:2 /usr/local/plt/collects/scheme/private/contract-arrow.ss:1347:3 /usr/local/plt/collects/scheme/private/contract-arrow.ss:1347:3 /usr/local/plt/collects/scheme/private/more-scheme.ss:175:6: loop /usr/local/plt/collects/scheme/private/contract-arrow.ss:1347:3 /usr/local/plt/collects/scheme/private/contract-arrow.ss:1347:3 /usr/local/plt/collects/scheme/private/more-scheme.ss:175:6: loop /usr/local/plt/collects/scheme/private/more-scheme.ss:175:6: loop /usr/local/plt/collects/scheme/private/contract-arrow.ss:1347:3 /usr/local/plt/collects/scheme/private/contract-arrow.ss:1347:3 /usr/local/plt/collects/scheme/private/contract-arrow.ss:1347:3 /usr/local/plt/collects/web-server/private/dispatch-server-unit.ss:62:2: connection-loop Maybe it's time to move to a single-servlet server? I don't understand the red tape, and I currently don't need anything fancy.. Let's try to build it anew from the documentation. OK. got something that works: start.ss: #lang scheme (require web-server/servlet web-server/servlet-env (rename-in "servlets/ramblings" (start servlet-start))) (serve/servlet servlet-start #:port 8181 #:command-line? #t #:servlet-path "/servlets/ramblings" #:servlet-regexp (regexp "^/servlets/ramblings") ) Entry: aggregation Date: Mon Mar 30 14:45:41 CEST 2009 I'd like to add some aggregation mechanism for navigating all different log entries to track my time usage. Having all this time available again after working out of the house on a full time basis leads to chaos. Most of my projects are linked in some way and insights in one lead to changes in others.. I can't keep track any more. Entry: parsing Date: Tue Mar 31 17:40:16 CEST 2009 I think I'm just going to remove the lex/yacc parser. Anything I want to do with syntax in ramblings files is going to be completely ad-hoc: datamining based on existing text files instead of any kind of sane structure. So let's just use a direct tokenizer. What the current lexer produces is a list of strings and xhtml elements in xexpr form. OK: replaced with whitespace/workd tokenizer state machine + individual word matcher. Entry: implementing aggregation Date: Wed Apr 1 11:17:52 CEST 2009 problems: - update: currently when file changes it gets recomputed - mutation is used to index the list of articles. solutions: - separate indexing (the node datastructure) from a pure sequential list of articles OK.. Using the straightforward approach creates problems. I'm not sure which evaluations get triggered again and again.. TODO: eliminate whole-datastructure traversals. Entry: links Date: Thu Apr 2 14:44:55 CEST 2009 Things like [1] ... [1] http:// Now that I have the body parser in a separate file this should be straightforward to implement. But, how to distinguish declaration and reference? An FIR style filter would be nice: one that slides a function over a window of the stream and builds a new stream. (define (ll-fir ll fn) ...) Got reference registration + evaluation working: now figure out how to bind urls. Entry: databases Date: Sun Apr 12 21:37:47 CEST 2009 So, what's the conclusion for the ramblings? It costs too much. It's probably best to separate data storage and website logic after all: the data isn't so dynamic, and the site is quite slow when updating from the large text files. Suggested split: * data cached in a database (SQLite probably, with untyped/snooze) * offline compiler to convert text -> database. Entry: what to use a db for? Date: Thu Apr 16 16:10:07 CEST 2009 This might seem a stupid question, but I'm thinking about using a database as a data cache only. Is this a good idea? Only if data flows unidirectional from a different store format (ramblings.txt) to the db, and never in the other direction (i.e. post comments). I'm biased towards text files: "source code" for data. I don't have a good feel for dbs due to lack of experience. How to design tables? In snooze [1] [2], is it ok to sometimes add tables or should we use a different table for things that might associated to objects? Let's stick to the real problems: * ramblings should be sped-up. reload is too slow with huge text files. * I'd like to add some kind of comment posts facility to the ramblings posts. [1] http://planet.plt-scheme.org/display.ss?package=snooze.plt&owner=untyped [2] http://planet.plt-scheme.org/package-source/untyped/snooze.plt/2/6/planet-docs/snooze/index.html Entry: indexing ramblings files Date: Thu Apr 16 16:16:50 CEST 2009 The ramblings files don't tend to change much. Let's invent an indexing format which stores post offsets. Entry: back to multiple servlets Date: Tue May 5 09:52:56 CEST 2009 Time to get some more functionality behind sweb. Basicly I'd like to turn it into an OS for all kinds of network applications with redundancy. Now, I did make an apache log analyzer before. Where dit it go? Ha! it's right here in lib/apache.ss Entry: Data structure of the day: reference counter. Date: Tue May 5 13:15:41 CEST 2009 This macro builds a data structure for reference counting and histogram building. (define-struct entry-ref (object refs) #:mutable) (define (add-ref! hash object entry) (let ((er (hash-ref hash object (lambda () (let ((er (make-entry-ref object '()))) (hash-set! hash object er) er))))) (set-entry-ref-refs! er (cons entry (entry-ref-refs er))) er)) (define-syntax (define-ref-struct stx) (define (fmt fmt-string . a) (datum->syntax (car a) (string->symbol (apply format fmt-string (map syntax->datum a))))) (syntax-case stx () ((_ name (fieldname ...)) (let ((fieldnames (syntax->list #'(fieldname ...)))) (syntax-case (list* (fmt "make-~s" #'name) (fmt "make-rc-~s" #'name) (for/list ((f fieldnames)) (list (fmt "set-~s-~s!" #'name f) (fmt "~s-ref" f)))) () ((make-instance make-rc-instance (set-field! field-param) ...) #`(begin (define-struct name (fieldname ...) #:mutable) (define field-param (make-parameter (make-hash))) ... (define (make-rc-instance fieldname ...) (let ((instance (make-instance fieldname ...))) (set-field! instance (add-ref! (field-param) fieldname instance)) ... instance))))))))) So I removed this completely (the names complicate it greatly) and created this function to work on tables instead: (define (table-share table) ;; Register the object and put a shared instance in the vector. (define ((register! vec) hash object i) (vector-set! vec i (reflist-object (add-ref! hash object vec)))) (let* ((n (vector-length (car table))) (hashes (build-vector n (lambda _ (make-hash))))) (values (for/list ((entry table)) (let ((v (make-vector n))) (for ((column entry) (hash hashes) (i (in-naturals))) ((register! v) hash column i)) v)) hashes))) Now I've updated this to generated a memoized expression by wrapping it in a 'let expression that produces the table with shared data when 'eval ed. This might come in handy somewhere else.. (aka memoization aka common subexpression elimination). box> (table->let (list (vector "foo" "a") (vector "foo" "b") (vector "foo" "c"))) (let ((0:0 '"foo") (1:2 '"c") (1:0 '"a") (1:1 '"b")) (list (vector 0:0 1:0) (vector 0:0 1:1) (vector 0:0 1:2))) Entry: log analyzer Date: Tue May 5 17:55:16 CEST 2009 main features: - bot filter - cross-referencing unique identifiers i do wonder if this is better than putting everything in a database. let's see if the cross-referencer can be abstracted better without so much mutable state. What is it? Converts a table of object relations by wrapping each object in a table entry in a reference list that links it to other table entries in which it occurs. So, this is better abstracted as a map: table -> table w. shared data set of hashes in fact, the hashes are a bit of a side-effect of introducing sharing. Entry: servlets Date: Wed May 6 12:23:29 CEST 2009 Goal: - run all scheme web apps in a single server (on giebrok + redundant on zwizwa and zzz) - solve data dependencies of dotp: data is on kurk only. Entry: databases Date: Wed May 6 16:25:18 CEST 2009 2 things to do: * apache logs * ramblings Let's try apache logs first since they can be easily separated. Now installing untyped/snooze. Let's have a look at its dependencies first: cobbe/contract-utils jaymccarthy/sqlite access to sqlite databases ryanc/require library indirection schematics/sake building automation schematics/schemeunit unit testing soegaard/galore functional data structures untyped/unlib misc untyped cce/scheme scheme programming utilities Entry: faster ramblings parsing Date: Thu May 7 10:51:41 CEST 2009 It might help to work on byte files instead of character files. Ok. This works fine: read the whole file as a byte string, then perform a regexp-split to segment articles, then lines, then parse the attributes and lazily parse the body. Entry: speeding up inner parser Date: Thu May 7 14:24:32 CEST 2009 http://localhost:8181/servlets/ramblings/staapl/20090212-184818 cpu time: 1228 real time: 1588 gc time: 108 cpu time: 2056 real time: 2455 gc time: 52 Now that's simply ridiculous. Let's see.. I think I need to have a proper look at the regexp syntax instead of writing all these adhoc tokenizers. Ok. All representation uses bytes now + _compiled_ regular expressions. This is what i get: cpu time: 19 real time: 20 gc time: 0 cpu time: 12 real time: 12 gc time: 0 A little better ;) Using strings instead of bytes makes it not more than twice as expensive. Entry: index Date: Thu May 7 18:03:52 CEST 2009 Now.. what about indexing the files? Ok. I've added machinery to perform registration of links and words. Care needs to be taken however to properly transport them across promises, since they are parameters. Then, I wanted to force articles in a separate thread to build an index while booting up. But apparently there is no built-in way to synchronize on a promise being forced.. Entry: composing regexps Date: Thu May 7 19:45:12 CEST 2009 It has to be possible to use some kind of abstraction to compose regular expressions. Let's have a look at the reference manual again.. TEST: (foo) (http://foo) <- parens are valid in http:// urls 'http://foo' "http://foo" {http://foo} http://zwizwa.be Entry: forcing + sync Date: Fri May 8 15:33:55 CEST 2009 How to make sure that an expression that needs an update won't get updated twice by two different threads? Let's do this for force-dynamic instead of force. A node can be in these states: * not accessed * cache check * cache update It's probably simplest to use a semaphore for this. syn Entry: viewing scribble docs Date: Sat May 16 01:02:12 CEST 2009 one of the things i keep finding annoying is to have to press refresh in a browser window. how to fix? i'm not versed in this xmlrpc stuff, but is it possible for a server to send a message to the client without it polling for new data? probably. Entry: scribble Date: Tue Jun 2 15:37:14 CEST 2009 Maybe the blog parser should be able to parse scribble docs? They are fairly readable as text in case a parser is not available.. Given a string representing a module, how do you evaluate it into xhtml that can be straigth embedded? Entry: library Date: Fri Jun 12 10:24:37 CEST 2009 The problem: my growing collection of locally cached electronics papers and books is getting quite large. I'd like to construct an interface for it + solve the problem of making sure it is available everywhere. Rationale for locally cached library: * Not all content is available on the web. * I'm not always online. * The total size is managable. Rationale for specific storage structure: * Data is read-only * The only operations are add and delete. * Want to avoid (exact) duplicates. * Not all machines are always on. (reason for distributed system) Ideally I'd like this to be a reference pool so it's easier to add references to papers. This is something that can grow however. It's best to get meta-data automatically from the web, and focus on caching the data, and linking the metadata. Some problems: * A web interface * Meta data format? * Organization + Search? Practical problems and solutions: * Only single files are indexed. This works best for ps.gz, pdf and djvu. Multiple files, use tar.bz2 archive + figure out how to unpack this in the viewer. For storage, maybe check here[1]. I'd like to move to an implementation where each file is indexed as an MD5 file. This would make it possible to tap into MD5 content hash databases. [1] http://en.wikipedia.org/wiki/Content-addressable_storage Entry: Datasheets Date: Fri Jun 12 12:58:41 CEST 2009 Testing my new library referencing in sweb[1]. The PIC18F252 datasheet[2]. [1] http://zwizwa.be/darcs/sweb [2] lib://85708a5accb49b829264d8556bcc853e Entry: library update Date: Fri Jun 12 14:31:45 CEST 2009 what works now: - "library" package with some scripts to modify and view the store - papers/md5.txt for description of md5-addressed content - an "About" button that googles for the ID of a ramblings post todo: an easy way to view a file in the local library cache. since this is actually allowing the execution of a command on the local machine it needs to be done with a bit of care.. the simplest way is to define a file type that can be passed to firefox. or a protocol handler. i can't get those to work though.. i don't understand firefox: it's gotten quite closed down over the years. Ok. Editing the mimetypes.rdf file in my firefox config tree, disabling the system defaults and telling it to ask made it work. Apparently this is a bug[1]. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=428658 Entry: multiple tags Date: Sat Jun 13 13:39:04 CEST 2009 It might be best to allow both the standard 8-digit date based indexing, and any of ISBN or MD5 hash indexing. Because of different NB digits these won't clash. Entry: Ramblings format features Date: Sat Jun 13 13:44:05 CEST 2009 - WYSIWYG: stick to fixed-width text which allows ad-hoc formatting (useful talking about code) and is easy to use in emacs. - Add a reference mechanism to allow URIs to be tucked away at the bottom of a post, not disturbing the flow of text. However, they are still visible instead of hidden as in HTML. There is too much information in the visual representation of a URI to not show it. - A ramblings file is a sorted list of entries. "^Entry:" is the split tag, and it is not allowed to be part of a post text. There is no escape mechanism, so you can't quote the description of an entry within the body of an entry. The header of an entry is a collection of "^: $" pairs. These have some meaning in the parser and mainly direct indexing. - Ad-hoc body parsing. There is room for extension, but the main idea is that standard URIs take precedence. The '{' and '}' tokens are still free for hierarchical grouping, but they are currently not used. The unit of parsing is the word. This means that URIs with spaces are not supported (replace them with %20). For extensions, it's probably best to use reserved words that don't clash with english or any code that would be present in a post body. Entry: LaTeX Date: Sat Jun 13 16:40:04 CEST 2009 I'd like to revive my math blog, but in a way that preserves the LaTeX layout. I.e. to have the convenience of the ramblings format, but with latex formatting. Let's see about some math blogs on the web. One is the unapologetic mathematician[1]. I'd like to do it in the following way: have a default www output, but allow for the destillation of a PDF. Hmm.. I tried several applications (html2tex, tth, tex2ht, latex2rtf) but none of them give proper output and they are slow. I'm thinking it's best to go back to the basics: latex + dvipng. (dvipng is a separate debian package). dvipng foo.dvi -o foo.png -T bbox Conclusion: with a bit of massaging this could be made into a simple way to display formatted tex on a page. Basicly, what you want is a single long page, and slice it into segments so browsers will display it properly. Then put the .tex source also on the page so it can be indexed properly. [1] http://unapologetic.wordpress.com/ Entry: Image test Date: Sat Jun 13 20:35:42 CEST 2009 Inline images now work. img://mathworld.wolfram.com/images/gifs/Rule110Big.jpg Next: how to display the .tex stuff? It would be nice to do this as a service, but this will lead into url length limits. So.. Maybe try to turn the latex|dvipng command into a script and integrate it with sweb for inline image generation. Let's split it in two: allow to extract a post as a plain file, then create a transformer for such extranctions (restricting all conversions to locally defined content to prevent use/abuse). Ok.. I didn't use the ? thingy in the url yet.. The subdirectory structure is nice for giving context to the requests, but per post there can be different operations tagged to this ? construct.. Let's have a look at the scheme docs. Ok. got queries as: (require web-server/http/request-structs) (require net/url-structs) (url-query (request-uri req)))) Next: capture dependencies + store intermediates.. The question is probably, do we store the .png in the memory image, or leave it on disk? In general, the problem is a disk-caching strategy: it's ok to have a live program manage dependencies between data, but the data itself can be stored externally. The same could be done for the parsed form of the ramblings file. Instead of keeping the parsed intermediates in memory, place it in disk storage. Also, it would be interesting to be able to split the problem in two: external programs take data from memory, and place them back into memory, but the actual location of the files could be made hidden. So. it's not just a cache, it's also a database of (mime-tagged) binary data. In short: * write a transparent storage mechanism that uses scheme objects on one side, and a directory+file structure on the other side. * this database should be accessible over http (without sweb running) meaning it needs a fully consistent on-disk representation for forced data, and a simple way to display promises. Entry: juggling binary objects -> caching continuations Date: Mon Jun 15 12:18:59 CEST 2009 Using the latex + dvipng the problem isn't really computation time. Rendering is fast. So why cache it? What about this: create an abstraction that map the operation of creating a directory with files onto the creation of a list of objects. I.e. a hash table. Then, in sweb when these files get transferred to the client, they can be garbage-collected. It would even be better if the .dvi could be cached in memory (since it's not that large), but the dvi2png can extract individual pages. Actually, this completely solves the problem. The hairy part is the fact that a .dvi or .tex represents a _collection_ of pages, but http requests are always about individual items.. So in modeling data structure, you need to think about dependencies and then apply memoization there. Simply put: dvipng can use indexed addressing easily. Then the 2nd problem: doing it this way the filesystem storage used as a scratchpad during the execution of a program can be abstracted completely. This is what makes things a whole lot less messy. Conclusion: * HTTP requests are about objects. This should be reflected in the in-memory model. * Some documents have a _logical_ hierarchical structure. I.e. html + embedded images. This can be reflected in the in-memory model using dependencies. * Intermediates can produce multiple objects which are requested asynchronously through http. The problem is the asynchronous nature of http. Because it has no concept of containment (which sucks if you ask me) this containement needs to be modeled elsewhere. Because of the production of multiple objects, some memoization is a good idea. As long as the memoized data isn't too large (in this case .dvi files are only slightly larger than .tex files) it can be kept around in memory. If not, some disk caching strategy might be necessary. The real insidiuous problem here is that you can't really garbage-collect anything: the client might request sub-documents, or might not. This is the central problem in server-side continuation management for instance.. You really want to transfer all this information to the client. Hey.. Can this be done for intermediate data also? I.e. instead of keeping the memoized .dvi around, can't we just dump the .dvi in its entirety to the client, then ask the client to give us the .dvi it wants rendered? It's the same thing as you'd want to do with continuation storage. The problem is really that continuations themselves tend to be large, and passing them back and forth between client and server is not a good idea.. So, caching is in order, and that is where the problems start. So essentially to solve the web continuation problem you just need a caching approach that works. That's all. But definitely not trivial as it's hard to define what a good caching strategy is.. [1] entry://../compsci/20090615-131905 Entry: calling latex and dvipng Date: Mon Jun 15 13:47:07 CEST 2009 Problem: given binary data represented in-image, call a script that processes it (possibly using local filesystem cache) and transfer the resulting file(s) back into image. Core idea: file systems do not support garbage collection (by design, since references to files cannot be tracked due to symbolic representation / lack of type information) so they need to be abstracted as local state[1]. So what's the essential abstraction? A 1-1 map between scheme objects and filesystem object. Whenever the object disappears, the filesystem object will be deleted. In PLT Scheme this is handled by finalizers[2][3]. > (require scheme/foreign) > (define (finalize it) (printf "finalizing: ~a\n" it)) > (register-finalizer (cons 1 2) finalize) > (collect-garbage) finalizing: (1 . 2) NEXT: find an api to make this easy. I.e. define a standard file system interface and some scheme functions to create and reference objects. [1] entry://../compsci/20090615-134849 [2] http://groups.google.com/group/plt-scheme/browse_thread/thread/9ae6c5a6c331431b [3] http://www.cs.brown.edu/pipermail/plt-scheme/2005-May/008898.html [4] http://download.plt-scheme.org/doc/4.1.5/html/foreign/foreign_pointer-funcs.html#(def._((lib._scribblings/foreign/unsafe-foreign..ss)._register-finalizer)) Entry: Flat view of ramblings Date: Mon Jun 15 16:03:26 CEST 2009 Shouldn't be too difficult to implement. Add a virtual "aggregate" topic which can 1. access all posts by simply searching them. 2. create a table of contents, sorted by key. Ok. 1. is implemented. 2. shouldn't be so difficult but is for another time. Now I'm wondering, maybe this should really be the default? Get rid of sections in the data representation, but use them only for the index + in-topic navigation. The constraint (make sure keys are unique) doesn't seem too restrictive. In-topic navigation could be implemented differently. Since this is just a filtered list an interpretation should be simpler. This however requires to move from a graph to a tree structure: posts no longer have a unique previous/next post... But.. From a human p.o.v. keeping sections is a good idea.. If not only to separate the sane from the unsane, the edited from the raw. One interesting property is that posts can move without breaking outside references, by writing aggregate indexing as a http redirect. Entry: tex -> png Date: Tue Jun 16 14:03:56 CEST 2009 Implemented. Currently images get rendered on demand. It's probably better to render them all at once so a wrapper xhtml file can be generated (it needs nb of pages). Entry: pdf Date: Wed Jun 17 11:21:43 CEST 2009 So.. now that -> png works for quick display, let's add a PDF button too, so google can index it. Entry: Making web links explicit Date: Wed Jun 17 11:27:54 CEST 2009 The problem: images and other embedded objects are handled separately by the server. Is there a way to somehow unify this view? I.e. construct an object that can be traversed by the server to pick out individual objects? The idea is that the server's addressing mechanism is of no concern to the document/node compiler. So: * document compiler buils graph structure * server queries graph structure. Server needs to provide addressing information to the graph compiler so proper addresses can be embedded in the documents. Entry: Hash Tables Of Closures Date: Wed Jun 17 11:50:38 CEST 2009 I'm trying to make explicit the design used in sweb. The document structure is a graph, where each node (accessed by a symbol) represents either a server response object or something else. It is a combination of the following principles: * late binding * lazyness + caching (= "non-functional" lazyness) * protype-based programming This is a common[1][2] pattern in web programming. Message sending happens by abstracting both nodes and messages as objects (hash tables). [1] http://www.paulgraham.com/noop.html [2] http://lispy.wordpress.com/2007/07/09/closures-hash-tables-as-much-oop-as-youll-ever-need/ Entry: Latex Rendering Date: Wed Jun 17 16:31:45 CEST 2009 Type: tex \LaTeX $ $ rendering is now impelemented, including references which are fished out of the comments. %[1] http://zwizwa.be/darcs/sweb Entry: Latex Multiple Pages Date: Thu Jun 18 17:03:48 CEST 2009 It might be handy to split a latex post into multiple (html) pages: * there is no easy way around formatting quirks when concatenating images top to bottom. * 2-column display might be feasible. * .tex file needs a single namespace, so splitting it into different posts doesn't work well. * for large papers, loading all the images is wasteful Entry: the md5:// links in firefox Date: Sat Jun 20 10:50:23 CEST 2009 Currently it's necessary to edit mimetypes.rdf manually. ... In addition you need to add it in the "about:config" panel. network.protocol-handler.external.md5 = true network.protocol-handler.expose.md5 = true network.protocol-handler.app.md5 = ... Doesn't work.. What a piece of opaque crap. Ok I got it to go again but in a frustrated manner so I don't know exactly what happened, but at this moment the configuration above, with alwaysAsk set to true did work after I made sure that the script was executable. I don't know if the about:config stuff is still necessary. It's filled in my current config, but didn't have any effect when I filled it in. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=428658 Entry: Split .txt files Date: Sat Jun 20 15:58:14 CEST 2009 Make it possible to gather a single ramblings section from multiple text files, to separate "constant" from "editable" data. Currently it takes too long to update large files (around 5000 lines it becomes problematic on my 1.8Ghz Pentium-M). Entry: auto-discover ramblings Date: Sat Jun 20 16:00:59 CEST 2009 Make ramblings files auto-discoverable: take the first line of the file as the description and re-read the directory on each request. Funny how this seems all quite trivial, but is really about learning how to manage events. Entry: separate indexing from storage Date: Wed Jun 24 15:14:15 CEST 2009 Before it's possible to create a single store for all the posts, it's probably better to separate the indexing from the post store. post -> section -> index -> next/prev Solved with a "toc" object that can perform tag indexing + relative offset. Entry: removed random link generator Date: Wed Jun 24 16:12:51 CEST 2009 ;; Add a random page generator (dynamic node) (n! page-links 'random (dynamic (list-ref-random node-list))) (define (list-ref-random lst) (list-ref lst (inexact->exact (floor (* (random) (length lst)))))) Entry: moving toward a single pool Date: Thu Jun 25 10:28:14 CEST 2009 What's necessary is to separate indexing from posts. However, in order to be able to navigate from one post to another using any indexing mechanism, the posts need a link to the index. First: replace nodelist->grap with nodelist->toc Next: separate out index building. done. Entry: atexit Date: Sat Jun 27 10:42:50 CEST 2009 http://list.cs.brown.edu/pipermail/plt-scheme/2009-March/031530.html looks like there is no simple solution, so it might be best to just stick with the current one. however, running a script in tail positiion in atext() could be the simplest solution. Entry: apachelogs Date: Fri Jul 3 09:26:12 CEST 2009 65.55.106.157 - - [12/Apr/2009:00:47:02 +0200] "GET /shop_content.php?coID=47&XTCsid=v6ldqlfhml4a9vv90igsmbj6sis9n77u HTTP/1.0" 404 16 "-" "msnbot/2.0b" The current parser chokes on this input, and I don't understand why. Probably best to use regular expressions. Ok, that seems to work fine. I've created 2 parsers, for combined + common log formats. parsing takes _way_ too long.. Ok, i tried to convert it to scheme reader format and it's even worse. I guess that's why one would use databases: the data is parsed. So, how does one go about managing such datasets? One thing is that these logs are very redundant. Simply indexing every field using 32bit integers would already do wonders. This would allow the graph structure to be represented separately from the node data. What am I interested in? - date - ip - referrer - agent (for bot filtering) - request Entry: logfile to graph Date: Sun Jul 5 11:27:50 CEST 2009 I wrote some code for this before. Let's have a look at it. It's in refs.ss Moving it to plt.ss Entry: apache logs Date: Thu Jul 9 12:24:37 CEST 2009 So.. This is the first thing I've dealt with in a long time where performance really matters. Reading the database from disk in text format, either directly from the Apache logfile syntax or preprocessed into scheme syntax is really too slow. Some other database storage mechanism needs to be used. Now.. Can standard methods be used? Using an sql database with proper shared data might really be enough. The basic idea is to have the graph structure represented in a form that is easily accessed. In the end, it is nothing but a bunch of integers. So I wonder: does using standard methods have _any_ advantage here? One would be external tools could use the database. Let's get the numbers right first. How many table entries for a year's worth of logs? 291 logfiles 1065149 entries 206M uncompressed 17.7M bzip2, one file 25.6M gzipped, one file / multiple files What strikes me is that zcat is very fast reading the files. tom@zni:/opt/apache-logs/access$ time bash -c 'zcat * | wc -l' 1065149 real 0m1.147s user 0m1.072s sys 0m0.072s Why is parsing in the current implementation so incredibly slow? I guess I'm seriously underestimating regex complexity. Anyways, what's needed is: bits date 32 ip 32 request 32 referrer 32 agent 32 20 bytes per row that's about 20MB of indexed data = quite close to the compressed sizes! Ok. Regex. Maybe the main problem is that my expressions are not anchored? No.. I feed it lines. Hmm.. Let's start encoding it into something that loads fast. It looks like a specific structure for representing the logfile is going to be way more interesting than a dumb regexp based approach. Especially since requests themselves are so structured. Basic idea: - turn the logfile into a graph with as little redundancy as possible. this will show the "real" structure of the data. basicly: define what a "token" is. - write queries on the graph structure Entry: git for deployment Date: Thu Jul 9 19:54:33 CEST 2009 I'm using darcs for a while now, and I like it. I don't use much advanced features, but it's mostly unvisible, except for not intelligently merging my ramblings files (for which it is not really to blame). But.. It's slow. I'd like to give git a try, but only for deploying the zwizwa website, to see if fast rollback is possible: if it doesn't work, rollback immediately and possibly automatically. Entry: dotp Date: Thu Jul 9 22:30:46 CEST 2009 Re-included dotp logic in sweb so i don't need two instances running. Entry: regexp matching Date: Sun Jul 12 10:02:45 CEST 2009 Changing the regexps so all matches are minimal matches dramaticly sped up the search. Entry: Databases Date: Wed Jul 15 09:55:57 CEST 2009 I don't understand databases[1]: In the database world, developers are sometimes tempted to bypass the RDBMS, for example by storing everything in one big table with two columns labeled key and value. While this allows the developer to break out from the rigid structure imposed by a relational database, it loses out on all the benefits, since all of the work that could be done efficiently by the RDBMS is forced onto the application instead. Queries become much more convoluted, the indexes and query optimizer can no longer work effectively, and data validity constraints are not enforced. This is exactly what I would do. Some terminology[2]: - a table represents a relation (subset of a product space) - a functional parameter dependency of A -> B means that given parameter A, there is a unique B (the relation is a function of A). - a candidate key is a minimal superkey. a supperkey is a collection of attributes that uniquely determines a row (there is a functional dependency key -> row). About performance[3]: In [4] is mentioned that first, you should start from a anormalized[2] design and optimize it, then you can start using indexing for columns that are frequently used as important selection criteria, sort criteria, and/or used in joins. [1] http://en.wikipedia.org/wiki/Inner-platform_effect [2] http://en.wikipedia.org/wiki/Database_normalization [3] http://en.wikipedia.org/wiki/Index_%28database%29 [4] http://www.15seconds.com/Issue/040115.htm [5] http://www.troubleshooters.com/littstip/ltnorm.html Entry: Passwords Date: Wed Jul 15 11:11:09 CEST 2009 How to handle authentication in the PLT webserver? Entry: firefox on unix sockets Date: Thu Jul 16 12:53:46 CEST 2009 http://support.mozilla.com/tiki-view_forum_thread.php?locale=cs&comments_parentId=97358&forumId=1 Yes.. It would make a lot of authentication problems a lot simpler.. Entry: tired of bad quality scans Date: Thu Jul 16 13:50:35 CEST 2009 I want to make a document viewer with custom image postprocessing in a web browser (serving .png files). The only part that's missing right now is random page access to pdf and djvu files. What I wonder is how pdfs that encapsulate bitmaps can be convinced to expose these bitmaps in non-resampled form. All pdf rendering i've seen asks for a DPI setting. Entry: md5 -> isbn index Date: Thu Jul 16 14:31:11 CEST 2009 move all books in the library to an isbn index. better still, use a single protocol handler to reference all kinds of extensions to firefox. i.e. stuff://isbn/123... these could then be replaced by somewhat helpful links for the external view, and directly link to functionality in the internal view. basic idea is that it's a pain to add protocol handlers or type handlers to firefox, so to do it once for all types is maybe best. Entry: check-syntax Date: Thu Jul 16 17:56:49 CEST 2009 Is it possible atm (i mean little work) to convert a bunch of source files to a cross-linked datastructure or html document? Entry: Databases Date: Fri Jul 24 11:51:03 CEST 2009 Persistance. Even though I like the idea of using ad-hoc data structures to represent data, it does seem that SQL-based storage is not going anywhere, so I'm going to spend a couple of hours getting something setup in a standard way. Probably best to start with a simple interface to reduce db ignorance before using the more involved untyped ORM `snooze'. I'm using [1] as a guide. PLT Scheme + SQLite (require (planet "sqlite.ss" ("jaymccarthy" "sqlite.plt")) (planet "sqlite.ss" ("soegaard" "sqlite.plt" 1 0))) Mini SQLITE cheatsheet: .help .tables .dump select * from
; Ok... This should be enough to try something with the apache db. It looks like it is a quite straightforward bridge between SQLite's SQL dialect and Scheme. Now I just need to learn SQL. [1] http://scheme.dk/blog/2007/01/introduction-to-web-development-with.html [2] http://www.sqlite.org/sqlite.html Entry: PLaneT Date: Fri Jul 24 12:01:36 CEST 2009 I'd like to figure out how to: - install offline - find dependencies From the sqlite deps, what is sake: build tool (like `make') Entry: Reverse parsing of ramblings files Date: Sun Aug 30 15:24:56 CEST 2009 A lot of code would be simpler to specify lazily if only the ramblings files would have the more recent articles at the top. The application is really only interested in recent entries, with the tails computed on-demand. However, maybe this isn't so important as the initial indexing can be performed quite fast in a strict way: a simple regexp search for "^Entry:" will do. Entry: Removed database files Date: Tue Sep 29 09:27:08 CEST 2009 hunk ./web-root/lib/db.ss 1 -#lang scheme -(require (planet untyped/snooze:2:6) - (planet untyped/snooze:2:6/sqlite3/sqlite3)) - -(define-snooze-interface - (make-snooze (make-database (string->path "/home/tom/test.db")))) ; TODO: arguments...))) - [_$_] -(provide (all-from-out (planet untyped/snooze:2:6)) - (snooze-interface-out) - (all-defined-out)) - - [_$_] rmfile ./web-root/lib/db.ss hunk ./web-root/lib/ramblings-dispatch.ss 58 - (with-input-from-file path - (lambda () (read-line)))) + (with-handlers ((exn:fail:filesystem? false)) + (with-input-from-file path + (lambda () (read-line))))) hunk ./web-root/lib/ramblings-dispatch.ss 68 - (b blurbs)) + (b blurbs) + #:when b) hunk ./web-root/lib/sqlite.ss 1 -#lang scheme/base -(require - (planet "sqlite.ss" ("jaymccarthy" "sqlite.plt")) ;; McCarthy + Welsh - (planet "sqlite.ss" ("soegaard" "sqlite.plt"))) - - -(define db-file "/tmp/test.sqlite") - -(define db #f) - -(define (create-db!) [_$_] - (begin - (with-handlers ((void void)) (delete-file db-file)) - (set! db (open (string->path db-file))))) - -(define (create-table-entries) - (exec/ignore - db - #< ALTER TABLE post ADD date;" - -(define-persistent-struct post [_$_] - ((title type:string) - (date type:time-utc) - (body type:string) - )) - [_$_] -(define (all) - (let-alias ((P post)) - (db (find-all (sql:select #:from P))))) - -(all) rmfile ./web-root/lib/test-db.ss Entry: Ramblings File Representation Date: Tue Sep 29 09:32:44 CEST 2009 Instead of building the index page statically, it might be better to generate it from an enumerator/stream. Ok, aggregation itself isn't so difficult, the problem is with propagating cache refresh signals up: the cache mechanism needs a compose operation. Hmm.. I suspect this is probably more a problem with my representation. It would be nice to figure out a way to detect structural changes (i.e. add/delete article) as opposed to content changes, and then simply updating this structure instead of regenerating it from scratch. What I really want is each article to be a separate entity. Can this be abstracted in that way? Alternatively, content could be kept completely on-disk, with expensive operations (like tex->png) cached on top of that, i.e. based on some hash of the text. Roadmap: abstract ramblings files as objects with the following interface: - header - node-list - node ref (symbol -> key.value hash) where data always comes directly from disk, and internal indexing is regenerated when the file changes. Entry: SXML planet library Date: Tue Oct 20 15:27:34 CEST 2009 [1] http://planet.plt-scheme.org/package-source/lizorkin/sxml.plt/1/1/doc.txt Entry: restructuring Date: Thu Dec 24 10:21:08 CET 2009 I'd like to restructure the app as a functional program: i.e. eliminate the graph structure of links, and make this implicit. Most code that actually does something is already structured as memoized/lazy/cached data. Entry: restructuring cont.. Date: Wed Jan 6 08:23:03 CET 2010 This is not simple. Essentially, this requires to get rid of the n! node assignment. Conceptually the design is ok, but it should be made clear which links are static (single assignment), and which are not. The model I'd like to use is still that of a graph, but constructed using lexical variables, so that at least the binding information is known at run-time. Currently it's just a bunch of nodes -- quite unstructured. The feeling I'm trying to obtain is that I can make an isolated object (an article) that can be represented not linked to the rest. So one has: - raw article - article views (html, rendered .tex) - list of articles (a ramblings.txt file's default order) - index - navigation Entry: more restructuring.. Date: Sat Jan 9 09:03:21 CET 2010 I don't want to kill the current ramblings implementation. I wonder if a full rewrite following a different model isn't a better approach. The graph n@ n! code is embedded quite deeply. The problems ramblings.ss actually solves: - article parsing - article views (html, rendered .tex) - list of articles (a ramblings.txt file's default order) - index - navigation - data management (memoization or dependency management) I also miss a decent emacs interface, and a definition of a single aggregated context object (i.e. text + embedded non-text objects), and a coupling to version control. Also, authentication would be nice. I need an object model and a concrete data store. The store should be simple to update, and hopefully human-readable. Large text files are nice, but not quite managable. So, what is an article? Essentially an object with different views. Wait.. Can the current representation be converted to single-assignment form? I like the memoization & cache approach, but a static name structure would be simpler. Also, indexing needs to be separated from construction. Maybe that's the place to start first? Add one layer of indirection there? Yes, this is really to complicated. The cross-linked objects model seems to work in practice. However, I can't prevent myself from thinking it's a ball of mud. You can't cleanly separate out a part, but mayble that's the point? This is really for later.. I need a clearly motivated goal to tackle this. Entry: Relational Lenses: Updatable Views Date: Thu Jan 21 15:35:10 CET 2010 [1] http://www.cis.upenn.edu/~bcpierce/papers/dblenses-pods.pdf Entry: sweb and emacs Date: Sat Feb 27 15:10:09 CET 2010 I'd like to tie sweb to emacs. In emacs: find the first occurence of 'Date:' and pass it to sweb, which will turn it into an ID. This needs: * index by ID only (not section/id). * index by `date` string Entry: Hmm.. not happy Date: Sun Mar 7 10:10:20 CET 2010 This graph business doesn't seem right. Isn't there a way to represent the code such that it is purely functional? Entry: The graph business. Date: Sun Mar 21 12:01:13 CET 2010 I've been whining about it for a while now. Time to get it over with. What is the problem? I'm building a full graph (pages linked together) but I would prefer data to take the form of a tree or directed acyclic graph. I.e. to structure most of the data management code as a function (a compiler) instead of a database. What does the sweb site actually do? - parses ramblings files - creates indexes - wraps a tex->png/pdf renderer What are the problems? The graph/node implementation is too stateful. Components can't be isolated easily. Solution: get rid of the graph/node implementation. Roadmap: start with ramblings parsing, and propagate the graph/node code dependencies up the module hierarchy. Starting point: parse-ramblings-fast.ss depends on graph.ss Concrete strategy: create a module entry.ss that abstracts all the concrete structure in a ramblings file. This seems to work: I can separate function and data (parsing and representation of content) from UI presentation (web page structure). Entry: Tex rendering Date: Sun Mar 21 14:06:35 CET 2010 The low-level parsing works. Now let's make the tex parsing into a functional compilation step. Currently the tex formatting uses self-access of attributes, i.e. OO style. A tex file is created from a source file and produces: dvi, png, pdf. There are two complications: * dvi is an intermediate for both pdf and png * There are multiple png and xhtml files; they need to be represented and cross-linked. * where to put the laziness and caching (recompile) ? Laziness can be embedded in the function. Caching is probably best handled on a higher level. The remaining problem is to mesh external references with internal representations. This is a problem that's best handled generally since I can forsee it popping up elsewhere (I've seen it in the past). So the goal is simple: create a tex->png/pdf renderer independent of the external access methods: LINKER (INDEXER) COMPILER (RENDERER) The compiler needs to be parameterized by an index method. It produces html that refers to other html and png files. Currently this is handled in a woven way: see `node-query->link'. Entry: Linking Date: Sun Mar 21 17:22:58 CET 2010 So, how do you compose a compiler that constructs a graph, and a link mechanism for that graph? ( Actually, there are 2 problems: external link mechanism + internal storage management ) More concretely. The result is a collection XHTML file with holes, and a collection of leaf nodes (i.e. png images), where the holes represent link elements. Higher order syntax? Let's unify nodes. Each object has a unique id (both xhtml and opaque nodes are represented as type-tagged binary blobs). The result is then: * repr :: id -> (mime, stream) -- provided by compiler * link :: id -> url -- provided by linker Constraint: the urls need to be permanent (don't break external links) and pretty/meaningful. Currently the tex->html/png compiler has: math/20090619-105812?page=4 math/20090619-105812?png=4 Entry: Stateful crap! Date: Sun Mar 21 19:54:21 CET 2010 Hmm. I can't get it back to work. Something is wrong with the permissions or path or whatever. The latex hangs, probably waiting for input, but when I start it manually all just works fine. There is also nothing in the logs. Ok. it works manually; running ../start as sweb user. Not through runit though: all works fine except latex hangs. How to get at its stdout? It seems the problem is that the output is not read. That's weird. Something wrong with the logger? The runit/log/run file had wrong permissions. Weird.. Who touched it? Entry: Lazy vs. Reactive Date: Tue Mar 23 15:45:20 CET 2010 So.. What do we have? 3 kinds of nodes: - lazy evaluation: eval once when needed - reactive: re-eval when needed (dependency-based) - thunk: re-eval always The lazy scheme works quite well. The dependency-based caching however is a mess. It currently doesn't compose; it should propagate requests to its leaves and trigger recompile if necessary. How to make it composable? It is a simple issue of making dependencies explicit. ( The main point for sweb is that you can't place a reactive component inside a lazy one. ) Functional reactive programming (FRP) usually takes one of 2 approaches: - pull: each query propagates to primitive sources and recomputes as necessary. - push: each event pushes through the dependency network to create new high-level events. Since the web is a pull architecture, I'm not sure if push is so interesting. However, if push events are available, one could re-compute to lower latency on the pull side. Ideally, at an exit node, you want to know _all_ its dependencies and check if it needs to be recomputed. What you want is for dependency information to travel separately from the value path. Specifying it explicitly is tedious. How to automate this in Scheme? Probably having some dynamic environment to record the evaluation trace is a good approach. Can we assume the trace to be static? That way it can be computed the first time a full evaluation is made. Alternatively: can it be done in syntax? (fully static dependency). [1] http://en.wikipedia.org/wiki/Functional_reactive_programming [2] http://en.wikipedia.org/wiki/Incremental_computing Entry: Pull FRP Date: Fri Mar 26 08:36:58 CET 2010 What about a special form that expands to a dynamic "trigger", i.e. check type types a t run time and recompute values if necessary. Problem: make the up-to-date check 1-shot. To avoid exponential explosion re-checking dependencies down the three, in a single evaluation, only check once for every node + cache result. This means it should be clear what a "pull" event is: it's a global concept. In other terms: a pull, and all its child pulls, happen at a single instance in time. Given a reactive network N, and a set of output nodes O, compute the value of the output nodes by recomputing at most once each node in the network. The question seems to be: how to make the input nodes transactional? I.e. if the pull happens at T0, any changes that happen after T0 do not affect the value of the output. The problem is that this information is not completely available; i.e. the network is not transactional, but there is a definite order of events (i.e. updates of the file system). So, let's summarize: * SHARING is important: you do not want to update a node multiple times, as this can lead to exponential complexity. * How do you know if a node is up-to-date? I.e. an input could change during a network computation, and be queried both before and after the update. This suggests you need some kind of TRANSACTION or logical time. I suppose in the `make' utility it is assumed that the inputs do not change during the evaluation of the network. So the question seems to be, how can sweb be instrumented to guarantee glitch-free operation? In[1] it is suggested to combine - push: limited-range invalidation; push invalid only when valid. - pull: lazy evaluation. This requires access to the input events. [1] http://en.wikipedia.org/wiki/Reactive_programming#Evaluation_models_of_Reactive_Programming Entry: Reactive programming Date: Sat Mar 27 17:52:59 CET 2010 So, the code in rp.ss [1] seems complete. It implements a generic dataflow network with bi-directional linking (one direction for pull functional dependencies and the other for push invalidation propagation). The main idea to make this fit well embedded in Scheme seems to be: to use ordinary lexical scope for the functional dependencies, and weak hashes for the inverse dependencies. [1] entry://../plt/20100327-160826 Entry: Moving the code to reactive implementation Date: Sat Mar 27 21:08:35 CET 2010 This means: - all functionality should have a purely functional core, including the indexing. OK - reactivity is added to the mix to provide the stateful web app. the only state is cache. OK - root objects should have file system notification so invalidation can propagate. OK - abstract url generation and dispatching (tree linking) - navigation (cross-linking) In essence: trees + (lazy) cross-references. It seems that the itch I have due is due to the cross- referencing in the current imp. It's all about binding; the rest is simple rewriting. How to do the file system notifications? Use jao's mzfam[1]. So what about indexing? It is just a function. Care needs to be taken to still perform a partial parsing step: parse article headers and construct a suspended body parse. The index then depends only on the headers. Url generation is also simple. Storage is a tree of data structures. Cross linking (navigation: next/prev) is separate. This can be delegated to the dispatcher: if it has an index, it can add re-directs for all symbolic (non-permanent) references. [1] http://planet.plt-scheme.org/display.ss?package=mzfam.plt&owner=jao Entry: FAM & Swindle Date: Sun Mar 28 10:42:11 CEST 2010 Hmm.. mzfam[1] doesn't work on current plt.. (is from 2007). Ok, replacing (lib "swindle" "swindle.ss") with `swindle' seems to do the trick. In fam-base.ss: (module fam-base swindle (provide (all-from swindle swindle/misc)) (require (lib "async-channel.ss") swindle/misc) Nope.. then `mappend!' can't be found.. adding the swindle/misc doesn't seem to help. I'm confused and tired, for next time.. Ok, this is the mutable lists problem.. Shall I fix it? Replaced mappend! -> mappend and sort! -> sort. Patch sent to jao. [edit: Jao fixed it; new version in planet.] [1] http://planet.plt-scheme.org/display.ss?package=mzfam.plt&owner=jao Entry: Cross linking Date: Sun Mar 28 11:59:41 CEST 2010 It seems that most of the core problems are easy to solve for a tree-based datastructure. The missing component is cross-linking. The problem is that this violates hierarchical composition: a network no longer has a recursive "collection of lower level entities" semantics; it's a big ball of mud instead. Can the linking be orchestrated by a single global linker entity? linker :: doctree -> graph That doesn't work as the doctree itself has (open) embedded crosslinks. So, essentially we need to view a doctree as an open term. linker :: docenv -> doctree -> graph So, how to represent open terms? Two things need to be unified: 1. the original source has some representation of links (i.e. standard http:// links or custom entry:// links) that needs to be mapped to the identifiers used by the naming scheme. 2. identifiers need to be mapped to raw links. Let's start the lib/link.ss module to implement this behaviour. The first modification is in the article body parser. Following the remark above we need to distinguish the ramblings syntax from the link mechanism we want to abstract. Essentially, the ramblings syntax has two kinds of links: external ones (standard http or transformed links like isbn or img) and internal cross links. The latter ones are important to capture. It seems this is not really necessary for the ramblings docs. The references used are already compatible with a tree + up-dir representation. So maybe this is only for indexing? Side note: see [1] for links about representing graphs in functional languages. The first one is a zipper rep. [1] http://lambda-the-ultimate.org/node/3195 Entry: Cross-linking re-take: ".." introduces graph structure Date: Mon Mar 29 19:59:16 CEST 2010 It's a strange problem, especially since you don't really notice it with all these urls floating around in documents. I.e. why is something like entry://../foo/123 not a good idea? Can we count ".." as a tree reference, or is it a graph reference? This is used so much that you really don't think about it. It's definitely a graph (check i.e ../plt/../plt/../ etc). It breaks encapsulation. I.e. it assumes the current context is part of some larger context. So, the conclusion is: compiling to a structure that has ".." references is possible (i.e. a set of html files). But, since this is _already_ a graph structure, it might be wise to abstract also those kinds of links generally, and fall back on the ".." links for off-line compilation. Entry: Scaling (why jango sucks) Date: Mon Mar 29 22:42:04 CEST 2010 Lots of babble, but this is what seems interesting: - normalization is good, denormalization sometimes makes queries faster. automate the denorm (cached db views?) - sessions: stick it in a cookie (you really don't want to keep track of user state for big apps) - "SELECT *" is fast - JOIN doesn't scale [1] http://www.youtube.com/watch?v=i6Fr65PFqfk Entry: Cheap thrills Date: Thu Apr 8 10:32:44 EDT 2010 Something to kick-start the intrinsic motivation; something to have a taste of control.. Some practical issues: PLT scheme and server restarts. Entry: Representing links Date: Thu Apr 8 12:58:58 EDT 2010 Is there anything useful to know about graphs in general from my perspective? I.e. I don't really care about structure that much, more about representation and direct connectivity. i.e. neighbours of one page, not neighbours of neighbours. It seems that the basic idea is closure. If a document contains a link, this can either be pointing into a closable neighborhood, or to an outside resource. We'd like to keep track of both: - represent neighborhood references such that wrong links can be detected statically. - gather open links to enable link checking and queries over links Let's have a look at this[1] zipper-based rep. It mentions that in a functional representation, for every cycle there has to be a point of decoupling which can be solved either by mutable cells or a finite map combined with unique identifiers. I suppose this is true if the graph can't be constructed in a single recursive `let'. [1] http://www.cs.tufts.edu/~nr/pubs/zipcfg-abstract.html Entry: To SQL or not to SQL? Date: Sun Apr 11 17:07:50 EDT 2010 I have no good intuition about dealing with databases. Let's look at the design space when using databases: 1. query language (SQL, Scheme, ...) 2. storage medium (memory, disk, server, ...) 3. update vs. query patterns Where 1. isn't a real issue (as long as you can express what you want to know it's fine) and 2. is pure implementation. However, 3. is quite a constraint. If there are no updates, data representation can be heavily optimized (compiled) to fit the need of the queries. (like in ramblings.ss) If there are updates, consistency becomes a serious problem, and needs to be solved properly, i.e. the ACID[1] principle, while caches are usually too complicated to keep up-to-date. The reason you want to sqeeze things into a single SQL query is exactly that. So that's one axis: mutable vs. immutable (where ramblings.ss is mutable but the representation uses a cache). Another axis is relational vs. non-relational. See next post. [1] http://en.wikipedia.org/wiki/ACID Entry: NoSQL? Date: Sun Apr 11 18:04:52 EDT 2010 From [1]: NoSQL is a movement promoting a loosely defined class of non-relational data stores that break with a long history of relational databases. These data stores may not require fixed table schemas, usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage. From [2]: So a lot of this NoSQL movement can be boiled down to 'avoid schemas that require joins'. It seems there is no real benefit to using NoSQL if there is no scaling involved, except for the ``sloppyness'' it allows. The main issue is, as in all _scalability_ issues, is locality of reference (stacks: cache recent access & streams: use predictable access). Joins are not local. [1] http://en.wikipedia.org/wiki/NoSQL [2] http://www.reddit.com/r/programming/comments/b7b1c/ask_proggit_why_the_movement_away_from_rdbms/ Entry: Reactive Ramblings Date: Fri Apr 16 07:12:12 EDT 2010 When does this work? To re-iterate one of the previous posts: this approach is about _caching_. * If there is _significant computation_ between the model data and the query results, and this data can be re-used, caching (lazy evaluation) is a valuable option. * If in addition the model data changes _infrequently_ the lazy evaluation can be combined with a reverse-dependent invalidation step to get a _reactive_ program. Note that when updates are frequent wrt. queries, it probably makes little sense to keep an intermediate (cached) representation. Let's move on with the reactive model for the ramblings.txt display. The basic idea is this: given a set of files as input, and a functional (data-flow) network that eventually ends in server queries, construct a servlet that manages this structure. Separate concerns: - abstraction of reactive network (naive pull/push implementation: plt/rp) - abstraction of functional dependencies - file alteration monitor - controller for setup Problem: what to do with create/delete? Current assumptions are that the dependency network is static. Is there a way to propagate "not-found" upwards? Maybe invalidation is enough + handling of "open" errors by the lazy eval. Problem: how to encode the 2-step parsing? I.e. I'm splitting a file into a collection of atoms, with each atom depending on the original file. How to express this dependency? The problem is that it is never allowed to "unpack" a reactive variable. Similar to monads.. So rv-force should be private! Solution is to indeed 1. only unpack at the toplevel/output which counts as _outside_ of the reactive network in the same way that file alteration events are and 2. represent a collection as a finite function / dictionary. Entry: Storage Date: Fri Apr 16 17:17:18 EDT 2010 An advantage of the current representation (functional dictionaries) is that they can be serialized to disk. This opens up the possibility to reduce the memory image and use the file system for bulk node data, keeping only the connectivity in image. However, is this data storage really a problem? Good question. How much space do these data structures take? Apparently it's currently dwarfed by the code image. Entry: Next Date: Fri Apr 16 17:42:07 EDT 2010 Anyways.. What's next? Fanout OK, so let's abstract the fold method which is necessary to build the index. Maybe it's best to create the accessor and fold at the same time? Entry: Separating pure functions from reactive computations Date: Sat Apr 17 13:10:15 EDT 2010 From actual usage it is really ok to just have `rv-app' or `rv-apply' and build abstractions on top of those. Having to abstract the pure part into a single value also helps with separating the side effects (reactive value nodes) from the pure computations that connect them. For one, this allows the pure parts to be tested and reused independently. Conclusion: at this point it seems useful to use the reactive value abstraction to explicitly structure caching and laziness in a functional program. The obligation to distinguish strict and reactive computations isn't necessarily bad, however, it is best to separate functionality from evaluation strategy and write reactive modules in terms of strict, pure functional modules that contain all the real work. Entry: Next Date: Sat Apr 17 13:13:37 EDT 2010 - abstract external link structure (higher order syntax). - implement tex + default xhtml rendering in the reactive approach. - connect it all and replace current graph-based implementation Entry: No representation Date: Sat Apr 17 13:30:19 EDT 2010 Seems there's a conflict. 1. Things like latex rendering definitely need to be cached. Can't have this be re-computed on every access. However, it should be possible to invalidate the cache also by other means. 2. Other data might be represented only as computations. Why not simply index the raw ramblings text files and have all formatting computed on the fly? Maybe it's not intuitive enough yet. Maybe what is needed is a wait to "point an shoot" nodes that should be cached and have the rest be lifted automatically. That would probably require a type system though.. Entry: Overdesigned Date: Sun Apr 18 12:50:09 EDT 2010 Anyways... This is getting a bit overdesigned as is usual for a labor of love. The patterns are interesting though. Caching and representation are important in practice in the sense that they are not important at all (and code complexity can be minimized, focussing on solving the main problem) or they are a big problem and code structure needs to incorporate representation as a cross-cutting concern that is difficult to isolate and can easily dominate the structure of the overall solution. Important conclusions: * Concrete intermediate data structures (as opposed to functional representation or thunks) are an optimization. It's probably best to keep the design functional and representations abstract. * Data indexing can easily be the most complicated step if the representation is not well thought-out, leading to a necessity to keep an intermediate representation to get decent performance. It might help to tailor the basic representation such that this step can be performed at "edit time" to provide for a finer incremental update footprint. I.e. for the ramblings format: while having a single flat file as basic representation is easy for editing in a text editor, but a pain to navigate as the index has to be completely rebuilt on every edit. The reactive dependencies make this pretty clear. Entry: Caching for ramblings.ss Date: Mon Apr 19 09:49:15 EDT 2010 There seem to be only 2 important parts to cache: - the initial parse (indexing). - the tex rendering In diagram form, only 3 kinds of nodes. [ TXT ] -> [ ARTICLES ] -> out -> [ PNG ] -> out TXT: The file node, invalidated by the FAM daemon ARTICLES: Indexed (segmented) file. Can serve raw text to output: we don't cache normal xhtml results. PNG: Tex rendering takes some time and produces multiple pages, so it's also cached. This caching strategy should be abstracted in a single module: rv-ramblings.ss It looks like implementing this, once it is made explicit like this, is quite straightforward. However, in order to get the whole thing to work, the representation (what _is_ an article?) needs to be fleshed out. For simple text it's straightforward (an xhtml document), for tex rendered it is more than that (a key-value store). Caching xhtml output does simplify the architecture, that way the pipeline becomes: [ TXT ] -> [ ARTICLES ] -> [ query -> RESPONSE ] Entry: Full circle Date: Tue Apr 20 16:07:32 EDT 2010 So, with the caching structured fleshed out as 2 level: txt -> articles -> (key -> response) The only remaining problem is to make explicit quat a query is. Currently that's a bit problematic. It might be simplest to stick to a very raw representation of a query, i.e. a dictionary of name.value pairs, where names are symbols and values are strings. To obtain an assoc list like this use: (url-query (request-uri req)) The `req' can be obtained from dispatch-rules, which will deconstruct an url and pass the req as the first parameter. So there's no problem here, a symbolic key (the entry ID) and a dictionary of optional parameters is all that's necessary for a query. The problem is memoization though: we need to extract keys to construct the final (req -> response) memoization, where response is an rv. Entry: Finishing the RV formatter Date: Sun Oct 3 11:49:47 CEST 2010 I broke it off abrubtly half a year ago so I have a bit of trouble to know what was actually done. It seems that the formatting didn't work yet, so let's build that first. Can the formatting be shared between old and new code? The file render-ramblings.ss only uses node query, not storage, so that should work. * abstract a node as a finite function. * abstract toc access (prev/next) as function Entry: Lights go back on Date: Sun Oct 3 14:07:21 CEST 2010 So I think I start to see the idea again: everything is a function operating on raw data, and in the toplevel you define the caching structure by wrapping values into dataflow nodes. Entry: Weird bug Date: Sun Oct 3 14:29:07 CEST 2010 Somehow the fam events don't propagate correctly when running inside the webserver. Running the module in the snot sandbox works fine. Maybe it runs multiple instances? rv-invalidate is executed, but something goes wrong.. maybe it's struct types? this is weird.. something behind the scenes that kills state or so.. i tried to use a different value for the #rv-invalid tag but that doesn't work either.. it really is not set in the value that is accessed by the fam, and set in the other one, so there have to be two instances. let's protect all RVs with a single semaphore: the one that guards the whole network update. Ok, that fixes the problem. Now can I understand why please?? Entry: Hashing intermediates Date: Sun Oct 3 17:14:37 CEST 2010 So what is cached? * Segmentation: text file -> dictionary What needs to be cached? * latex + dvipng This means that there needs to be a hash table of rvs to implement the cache, and possibly some thread that periodically clears the cache. Entry: Done except.. Date: Mon Oct 4 13:47:01 CEST 2010 - topics should be trivial, copy from other and remove any stateful code. maybe addition: allow directory change notifications for reaload of topics? - navigation (prev / next) it's simple to add this to the index generation, but i'd like to also create nav for combinations of different sources. maybe it's still best to put it in the index then. Entry: Racket update Date: Fri Mar 18 12:23:56 EDT 2011 facade.ss: - make-response/full + response/full Then also the contract seems to have changed: Servlet (@ /ramblings.ss/topics) exception: self-contract violation: expected , given: (html (pre " " " " " " " " " " " " " " "[" (a ((href "about\n")) "about") "] " "About the http://zwizwa.be web logs." "\n" " " " " " " " " " " " " " " ...)) contract on start from /home/tom/pub/darcs/sweb/web-root/./htdocs/./ramblings.ss blaming /home/tom/pub/darcs/sweb/web-root/./htdocs/./ramblings.ss contract: (-> request? can-be-response?) at: /home/tom/pub/darcs/sweb/web-root/./htdocs/./ramblings.ss Entry: Flat namespace Date: Tue Aug 9 13:09:32 CEST 2011 I want to do the following: - get a sorted list of all article IDs by date. should work with cache: don't update this list if the underlying files didn't change. this probably can be done by storing the ID list in the contents node, next to the index proper. - query article using ID only. this needs an ID -> section map and requires all IDs to be unique. To get at the list I had to: - Change `format-index' in entry-index.ss to collect (Index . ID) pairs and store them in the index node under IndexIDs. - Create a function get-index-ids: ;; Get all Index->ID maps. (Index is parsed from Date in parse-article) (define (get-index-ids [sections *sections*]) (apply append (for/list (((section get-article) (in-dict sections))) (dict-ref (rv-force (get-article 'contents)) 'IndexIDs)))) Now, how to use: - Don't put it on the front page. Use a separate link, i.e. "recent" to get a chronological page with entries like: 20110809-130932 [sweb] Flat namespace - Use a redirect for section mismatch. Now, to get at the article titles without forcing body compile, it's necessary to fish out that information in the first parse, because a query for 'Entry on the rv value will trigger a full body compile, including latex etc. So, I'm moving the index-gathering step to make-index. Entry: dotfiles Date: Fri Dec 14 15:48:41 EST 2012 I'd like to hide some files from the index. Let's use a dotfile approach, such that .xxx.txt still maps to xxx but is not included in the index. Hmm... doesn't look like a good idea. Let's just add a banlist. Entry: How to do style sheet syntax in s-expressions? Date: Fri May 24 12:18:20 EDT 2013 pre { padding: 0 3px 2px; font-family: Menlo, Monaco, Consolas, "Courier New", monospace; font-size: 12px; color: #333333; -webkit-border-radius: 3px; -moz-border-radius: 3px; border-radius: 3px; } Entry: Testing raw xhtml Date: Fri May 24 12:53:03 EDT 2013 Type: xhtml

Testing raw XHTML

Another paragraph with a bullet list

  • with
  • bullets
Entry: Replies Date: Sun Aug 11 11:03:16 EDT 2013 I like this email-oriented blog: http://www.acooke.org/cute/WhyandHowW0.html Entry: sweb is dead Date: Fri Jan 3 01:37:28 EST 2014 Basically, I would like to add features, but I don't really like the architecture. The reactive thingy is cute, but it is not necessary, and requires too much infrastructure. This is the age of static block generators and git deployment ;) Requirements: - static generator - latex -> mathml http://math.etsu.edu/LaTeXMathML/ - better indexing - rss / atom - comments Entry: zwizwa.sty not found Date: Wed Apr 30 19:22:52 EDT 2014 problem was sh -> dash ( after upgrade? ) the TEXINPUTS variable didn't propagate -> nope Something is messed up in conjunction with runit.. Currently starting manually from screen Entry: How does the txt file list update? Date: Wed Apr 29 09:55:30 EDT 2020 I'm dumping a new set of txt links in web-root/txt, but I dont' see anything showing up. I completely forgot how this all works... First: where are the logs? The runit log is here: /var/log/sweb/current But there is another log. root@tomweb:/var/log/sweb# ls -l /proc/122/fd ... lr-x------ 1 sweb sweb 64 Apr 29 16:01 11 -> /home/tom/pub/darcs/sweb/web-root/banlist l-wx------ 1 sweb sweb 64 Apr 29 16:01 12 -> /home/tom/pub/darcs/sweb/web-root/log ... Looks like I have two trees: root@tomweb:/home/tom/pub/darcs/sweb/web-root# readlink -f . /home/tom/pub/darcs/sweb/web-root root@tomweb:/home/tom/pub/darcs/sweb/web-root# readlink -f /var/www/zwizwa.be/darcs/sweb/ /var/www/zwizwa.be/darcs/sweb And it's starting the one in /home/tom So let's change that with a link.