s11n data files

Or: What does s11n do with my data?

As has been repeated many times over, libs11n is internally data-format agnostic. What does this mean? It means that it doesn't really care what format your data is in. The library must expect some conventions to be followed, the most notable of which is that data is expected to be structurable in a DOM-like model, but it doesn't inherently care what data store is used for object persistance. The core lib works only at the level of DOM-like trees of abstract data, and knows nothing about file i/o.

The exact data formats are read and written by so-called Serializers, which are described in more detail on their page. On this page we will take a quick look at some file format comparisons for various data sets. The data formats we will look at here are described in more detail on the Serializers page.

Keep in mind that clients are not required to use libs11n's built-in i/o layer: they may provide their own arbitrary i/o layer and still take advantage of the core serialization interfaces.

We're going to be a bit crude here, and simply show a lightly edited copy of a shell session...
First, a script which we use to mass-convert a given input file:

#!/bin/bash
S="compact expat funtxt funxml parens simplexml wesnoth"
inf=${1?"Usage: {site_content} input_filename"}

lsfilter()
{
   ls $@ | awk '{ print ,;}'
}
echo "Input file: "
lsfilter -l $inf
for s in $S; do
    echo -n $s...
    of=${inf%%.*}.$s
    time ~/bin/s11nconvert -f $inf -s $s -o $of
    lsfilter -l $of
    echo
done

Now some data... a file containing 54400 object nodes (much larger than the average data file):


stephan@owl:~/> ls -lS *.s11n
-rw-r--r--  1 stephan users 4894806 2004-09-29 23:41 biggie.s11n
The data format of input file is largely irrelevant, except that it will
impact the overall runtime (some serializers read or write more slowly
than others).
Run our "test":
stephan@owl:~/> ./stest.sh biggie.s11n
Input file:
4894806 biggie.s11n
compact...
real    0m3.360s
user    0m3.036s
sys     0m0.062s
2493010 biggie.compact
expat...
real    0m4.491s
user    0m3.958s
sys     0m0.214s
4720430 biggie.expat
funtxt...
real    0m3.897s
user    0m3.222s
sys     0m0.078s
4141438 biggie.funtxt
funxml...
real    0m3.781s
user    0m3.503s
sys     0m0.081s
4894806 biggie.funxml
parens...
real    0m3.160s
user    0m2.954s
sys     0m0.060s
2691751 biggie.parens
simplexml...
real    0m3.801s
user    0m3.471s
sys     0m0.078s
3658867 biggie.simplexml
wesnoth...
real    0m3.331s
user    0m3.100s
sys     0m0.083s
3902750 biggie.wesnoth

The actual load times, not including the startup time of s11nconvert, boil down to loading between 30k and 50k object nodes per second, depending on the data format, layout of the objects, etc. The sample data included deeply nested containers of objects containing several properties each (mostly numeric data with some strings).

Note that the above files don't use any sort of compression. If we enable compression in s11nconvert (via the -z and -bz flags) we can significantly reduce file sizes (assuming your copy of libzfstream was built with zlib/bz2lib support). The same data files, with and without compression (compressed via s11nconvert, not the gzip and bzip2 tools, though the results should be the same or very similar):


2493010		biggie.compact
14010		biggie.compact.bz2
178074		biggie.compact.gz
4720430		biggie.expat
36695		biggie.expat.bz2
204032		biggie.expat.gz
4141438		biggie.funtxt
31513		biggie.funtxt.bz2
199008		biggie.funtxt.gz
4894806		biggie.funxml
37072		biggie.funxml.bz2
205446		biggie.funxml.gz
2691751		biggie.parens
23992		biggie.parens.bz2
176088		biggie.parens.gz
3658867		biggie.simplexml
28020		biggie.simplexml.bz2
184820		biggie.simplexml.gz
3902750		biggie.wesnoth
31438		biggie.wesnoth.bz2
196264		biggie.wesnoth.gz

Yes, those bz2 compression levels are real! That compressor beats most others hands down, but is also notably slower than zlib. In fact, for large data sets, using zlib compression can actually speed up the read and write times by a small amount! bz2lib, however, is dog slow (but damned good).

Client code can set the compression level framework-wide with any of the following:

zfstream::compression_policy( zfstream::GZipCompression );
zfstream::compression_policy( zfstream::BZipCompression );
zfstream::compression_policy( zfstream::NoCompression );

That policy is respected by the s11n::io implementation.

Main menu:

Download from:

Libs/apps/utils:

SourceForge:

More Resources:

Related sites:

s11n data files