[<<][pool][>>][..]
Mon May 11 18:39:18 CEST 2009
modifying pdf2djvu
Looks like [1] is a good canditate for extending with an erosion
operation. This way a .djvu file can be created for viewing, which is
also faster than pdf. pdf2djvu uses the poppler[2] rendering engine
(which in turn is based on xpdf / splash).
Now poppler uses Cairo[3] so I wonder if it's not possible to use its
capabilities to do the filtering. This seems unlikely.. Looks like
postprocessing is the only option.
So.. A bit more inspection of the code shows that image data gets
written out to disk (!) to pass data to the c44 wavelet compressor.
This seems to be an interesting point to hook into.
Apparently I have old code installed:
tom@zzz:/usr/local/bin$ ls -al | grep 2003-08-28
-rwxr-xr-x 1 root staff 13176 2003-08-28 14:05 bzz
-rwxr-xr-x 1 root staff 35504 2003-08-28 14:05 c44
-rwxr-xr-x 1 root staff 32888 2003-08-28 14:05 cjb2
-rwxr-xr-x 1 root staff 41384 2003-08-28 14:05 cpaldjvu
-rwxr-xr-x 1 root staff 50408 2003-08-28 14:05 csepdjvu
-rwxr-xr-x 1 root staff 28304 2003-08-28 14:05 ddjvu
-rwxr-xr-x 1 root staff 19816 2003-08-28 14:05 djvm
-rwxr-xr-x 1 root staff 12784 2003-08-28 14:05 djvmcvt
-rwxr-xr-x 1 root staff 9368 2003-08-28 14:05 djvudump
-rwxr-xr-x 1 root staff 17680 2003-08-28 14:05 djvuextract
-rwxr-xr-x 1 root staff 47536 2003-08-28 14:05 djvumake
-rwxr-xr-x 1 root staff 29968 2003-08-28 14:05 djvups
-rwxr-xr-x 1 root staff 110544 2003-08-28 14:05 djvused
-rwxr-xr-x 1 root staff 25264 2003-08-28 14:05 djvuserve
-rwxr-xr-x 1 root staff 15192 2003-08-28 14:05 djvutxt
Inspecting:
pdf2djvu -vvv -o foo.djvu ~/library/pool/lazy_specialization.pdf
- page #9 -> #9:
- muted render
- image size: 2479x3508
- verbose render
- create sep_file
- rle data >> sep_file
- text layer >> sep_file
- !csepdjvu
- !djvuextract
FGbz=/tmp/pdf2djvu.5Ha0nY --> "/tmp/pdf2djvu.5Ha0nY" (661 bytes)
BG44=/tmp/pdf2djvu.QSKcx8 --> "/tmp/pdf2djvu.QSKcx8" (115 bytes)
Sjbz=/tmp/pdf2djvu.1HENeO --> "/tmp/pdf2djvu.1HENeO" (1173 bytes)
- annotations >> sed_file
- !djvused >> sed_file
- !djvumake
- !djvused < sed_file
- 2318 bytes out
Aha!
It's actually csepdjvu that's called with "R6 2479 3508 216" input data.
/usr/bin/csepdjvu -d 300 /tmp/pdf2djvu.keGpt7 /tmp/pdf2djvu.0bEZjG/p0057.djvu
So, replacing /usr/bin/csepdjvu with this:
# echo $0 "$@" >>/tmp/`basename $0`
[ -z "$CSEPDJVU_PREPROC" ] || $CSEPDJVU_PREPROC "$@"
exec $0.real "$@"
Now.. The input format is color RLE which is difficult to use.
Color RLE format
The Color RLE format is a simple run-length encoding scheme for
color images with a limited number of disâtinct colors. The
data always begin with a text header composed of the two
characters "R6", the number of columns, the number of rows, and
the number of color palette entries. All numbers are expressed
in decimal ASCII. These four items are separated by blank
characters (space, tab, carriage return, or linefeed) or by
comment lines introduced by character "#". The last number is
followed by exactly one character which usuâally is a linefeed
character.
The header is followed by the color palette containing three
bytes per color entry. The bytes represent the red, green, and
blue components of the color.
The palette is followed by a collection of four bytes integers
(most significant bit first) representing runs of pixels with
an identical color. The twelve upper bits of this integer
indicate the index of the run color in the palette entry. The
twenty lower bits of the integer indicate the run length.
Color indices greater than 0xff0 are reserved. Color index
0xfff is used for transparent runs. Each row is represented by
a sequence of runs whose lengths add up to the image width.
Rows are encoded starting with the top row and progressing
toward the bottom row.
csepdjvu does support PPM input, so how to convince pdf2djvu to
produce this format?
The format is generated in quantizer.cc
Apparently debians netpbm package doesn't support this format, so I'm
installing the one from here: [4].
Hmm.. netpbm doesn't support converting _from_ this format and
csepdjvu needs either "Color RLE format" or the "Bitonal RLE format"
as input for the foreground image.
So it looks like the solution is to plug in a preprocessing step right
before the rle conversion.
So, not trivial.
[1] http://code.google.com/p/pdf2djvu/
[2] http://poppler.freedesktop.org/
[3] http://en.wikipedia.org/wiki/Cairo_(graphics)
[4] http://netpbm.sourceforge.net/
[Reply][About]
[<<][pool][>>][..]