Mon May 11 18:39:18 CEST 2009

modifying pdf2djvu

Looks like [1] is a good canditate for extending with an erosion
operation.  This way a .djvu file can be created for viewing, which is
also faster than pdf.  pdf2djvu uses the poppler[2] rendering engine
(which in turn is based on xpdf / splash).

Now poppler uses Cairo[3] so I wonder if it's not possible to use its
capabilities to do the filtering.  This seems unlikely..  Looks like
postprocessing is the only option.

So.. A bit more inspection of the code shows that image data gets
written out to disk (!) to pass data to the c44 wavelet compressor.
This seems to be an interesting point to hook into.

Apparently I have old code installed:

tom@zzz:/usr/local/bin$ ls -al | grep 2003-08-28
-rwxr-xr-x  1 root staff    13176 2003-08-28 14:05 bzz
-rwxr-xr-x  1 root staff    35504 2003-08-28 14:05 c44
-rwxr-xr-x  1 root staff    32888 2003-08-28 14:05 cjb2
-rwxr-xr-x  1 root staff    41384 2003-08-28 14:05 cpaldjvu
-rwxr-xr-x  1 root staff    50408 2003-08-28 14:05 csepdjvu
-rwxr-xr-x  1 root staff    28304 2003-08-28 14:05 ddjvu
-rwxr-xr-x  1 root staff    19816 2003-08-28 14:05 djvm
-rwxr-xr-x  1 root staff    12784 2003-08-28 14:05 djvmcvt
-rwxr-xr-x  1 root staff     9368 2003-08-28 14:05 djvudump
-rwxr-xr-x  1 root staff    17680 2003-08-28 14:05 djvuextract
-rwxr-xr-x  1 root staff    47536 2003-08-28 14:05 djvumake
-rwxr-xr-x  1 root staff    29968 2003-08-28 14:05 djvups
-rwxr-xr-x  1 root staff   110544 2003-08-28 14:05 djvused
-rwxr-xr-x  1 root staff    25264 2003-08-28 14:05 djvuserve
-rwxr-xr-x  1 root staff    15192 2003-08-28 14:05 djvutxt

pdf2djvu -vvv -o foo.djvu ~/library/pool/lazy_specialization.pdf

- page #9 -> #9:
  - muted render
  - image size: 2479x3508
  - verbose render
  - create sep_file
  - rle data >> sep_file
  - text layer >> sep_file
  - !csepdjvu
  - !djvuextract
  FGbz=/tmp/pdf2djvu.5Ha0nY --> "/tmp/pdf2djvu.5Ha0nY" (661 bytes)
  BG44=/tmp/pdf2djvu.QSKcx8 --> "/tmp/pdf2djvu.QSKcx8" (115 bytes)
  Sjbz=/tmp/pdf2djvu.1HENeO --> "/tmp/pdf2djvu.1HENeO" (1173 bytes)
  - annotations >> sed_file
  - !djvused >> sed_file
  - !djvumake
  - !djvused < sed_file
  - 2318 bytes out

It's actually csepdjvu that's called with "R6 2479 3508 216" input data.
/usr/bin/csepdjvu -d 300 /tmp/pdf2djvu.keGpt7 /tmp/pdf2djvu.0bEZjG/p0057.djvu

So, replacing /usr/bin/csepdjvu with this:

# echo $0 "$@" >>/tmp/`basename $0`
exec $0.real "$@"

Now.. The input format is color RLE which is difficult to use.

   Color RLE format

       The Color RLE format is a simple run-length encoding scheme for
       color images with a limited number of dis‐tinct colors.  The
       data always begin with a text header composed of the two
       characters "R6", the number of columns, the number of rows, and
       the number of color palette entries.  All numbers are expressed
       in decimal ASCII.  These four items are separated by blank
       characters (space, tab, carriage return, or linefeed) or by
       comment lines introduced by character "#".  The last number is
       followed by exactly one character which usu‐ally is a linefeed

       The header is followed by the color palette containing three
       bytes per color entry.  The bytes represent the red, green, and
       blue components of the color.

       The palette is followed by a collection of four bytes integers
       (most significant bit first) representing runs of pixels with
       an identical color.  The twelve upper bits of this integer
       indicate the index of the run color in the palette entry.  The
       twenty lower bits of the integer indicate the run length.
       Color indices greater than 0xff0 are reserved.  Color index
       0xfff is used for transparent runs.  Each row is represented by
       a sequence of runs whose lengths add up to the image width.
       Rows are encoded starting with the top row and progressing
       toward the bottom row.

csepdjvu does support PPM input, so how to convince pdf2djvu to
produce this format?

The format is generated in quantizer.cc

Apparently debians netpbm package doesn't support this format, so I'm
installing the one from here: [4].

Hmm.. netpbm doesn't support converting _from_ this format and
csepdjvu needs either "Color RLE format" or the "Bitonal RLE format"
as input for the foreground image.

So it looks like the solution is to plug in a preprocessing step right
before the rle conversion.

So, not trivial.

[1] http://code.google.com/p/pdf2djvu/
[2] http://poppler.freedesktop.org/
[3] http://en.wikipedia.org/wiki/Cairo_(graphics)
[4] http://netpbm.sourceforge.net/