Tools

For command line and json

1. wgdb - general database management

This is a simple command-line utility that allows creating and freeing a database, dump, import, run some tests and more.

Usage:

wgdb [shmname] <command> [command arguments]

The shared memory name identifies the database to use and is an arbitrary numeric value. If it is omitted, the default ("1000") will be used.

Commands commonly available:

help (or "-h") - display this text.
version (or "-v") - display libwgdb version.
free - free shared memory.
export [-f] <filename> - write memory dump to disk
      (-f: force dump even if unable to get lock)
import [-l] <filename> - read memory dump from disk. Overwrites  existing
      memory contents (-l: enable logging after import).
exportcsv <filename> - export data to a CSV file.
importcsv <filename> - import data from a CSV file.
replay <filename> - replay a journal file.
info - print information about the memory database.
add <value1> .. - store data row (only int or str recognized)
select <number of rows> [start from] - print db contents.
query <col> "<cond>" <value> .. - basic query.
del <col> "<cond>" <value> .. - like query. Matching rows are deleted from database.
createindex <column> - create ttree index.
createhash <columns> - create hash index (JSON support).
dropindex <index id> - delete an index.
listindex - list all indexes in database.
server [-l] [size b] - provide persistent shared memory for other processes (Windows).
       (-l: enable logging in the database).
create [-l] [size] - create empty db of given size (non-Windows).
       (-l: enable logging in the database,
       mode: segment permissions (octal)).

1.1. Importing and exporting data

Data may be exported to and imported from text files. This provides a way to exchange data between WhiteDB on different platforms or other data sources. The simplest format that is always available is CSV (comma separated values).

Since there is no straightforward mapping between most of WhiteDB types and the CSV format and as CSV is not standardized, only limited support for data types is available. The following data types are recognized when importing from CSV:

  • NULL - empty string
  • int - plain integer
  • double - floating point number in fixed decimal notation
  • date - ISO8601 date
  • time - ISO8601 time+fractions of second.
  • string - input data that does not match the above types

The field separator is , (comma). The decimal point separator is . (dot).

WhiteDB may also provide RDF (if libraptor is available) and JSON (ongoing development, undocumented) support.

1.2. Memory dumps and journal logs

In the default configuration, WhiteDB runs in shared memory only. There are two methods of providing data persistence: memory dumps, which are snapshots of the current database contents and journal logs which contain incremental updates to the database contents. Logs are only available if the database is configured with the ./configure --enable-logging option and need to be explicitly enabled.

Managing memory dumps

A memory dump is the memory image of a WhiteDB database, saved into a file. It stores everything except concurrency control and journal file metadata. Images are not compatible between systems of different endianness or word size (the most common case would probably be 32-bit vs 64-bit systems). Also, different versions of the database library may use different format for the memory image, therefore WhiteDB automatically refuses to import an image made with a different library version.

Type wgdb -v to list the compatibility information of the database library. It will display something like this:

libwgdb version: 0.7.0
byte order: little endian
compile-time features:
64-bit encoded data: yes
queued locks: yes
chained nodes in T-tree: yes
record backlinking: yes
child databases: no
index templates: yes

Memory dumps are suitable for creating snapshots of the database for backup purposes. Type

wgdb export imagename.bin

to create a backup of the current database (since the shared memory name was omitted, the wgdb utility will use "1000" by default).

wgdb import imagename.bin

Will restore the image from disk. It will completely overwrite the current memory contents. Note that if there is an existing memory segment, it needs to be large enough to fit the image. Otherwise the import will fail with an error message (and the shared memory segment will not be modified).

If the wgdb tool is unable to access the memory image (for example, due to a programming error that causes the database to become permanently locked), a "rescue" dump can be created with

wgdb export -f rescue.bin

Note however, that in such cases care must be taken, as it is unknown what type of errors the image may contain.

Managing journal logs

Journal logs provide a way of keeping a continuous backup of the database. All of the changes to the shared memory are logged incrementally and the journal logs may be played back to repeat all of those changes, bringing the database to the same state as it was at the time the log file ended.

To enable journal logging, use

./configure --enable-logging --with-logdir=./logs

during the building of the database. Replace ./logs with wherever you’d like the library to store the journal files. This location and the names of the journal files are not changeable during runtime due to security reasons. If you do not specify the log directory, /tmp (or \windows\temp) will be used by default, which may work for testing, but is probably undesirable in the long term.

Journaling must be enabled at the time of database creation. To do this from the command line, supply the -l switch to either the wgdb create command or when importing an image with the wgdb import command. This will cause a journal file to be created. It will be placed in logdir/wgdb.journal.shmname where shmname is the database name, for example, "1000". The journal is then incrementally written, until one of the three things happens:

  1. A dump file is created, either by wgdb export or by calling the wg_dump() function
  2. A dump file is imported (again, by command line or API)
  3. A journal file is replayed.

In each of these cases, the current journal file is backed up, appending .0 to its name (or .1, if .0 already exists and so forth) and a fresh journal file is started.

The first case can be considered normal usage, as it creates a snapshot of the database. Since this snapshot contains everything that the journal has logged up to this point, the journal is no longer necessary for recovery and a fresh one will be used.

Importing the dump file will either be part of a recovery process or to simply work with a new image. In either case, the previous journal has become irrelevant to the current database contents, making it necessary to start a new one.

Finally, a journal replay itself will be not logged in a journal. Therefore, the database contents after the replay and the journal file that was in use during the replay have become inconsistent, similarly than whan happens with importing a memory dump. Generally, journal backups caused by these recovery actions should be cleaned up or moved away. The user is expected to handle this manually, case by case.

Journal log example

Assuming we use the ./logs directory to store the journal files and the database has been compiled with journal support, let’s start by enabling the journal and adding data to the database:

wgdb create -l
wgdb 1011 add 1 2 3

The contents of the ./logs directory will now be:

-rw-rw-rw- 1 user group 27 Dec  7 22:00 wgdb.journal.1011

And the contents of the database, by typing wgdb 1011 select 10:

[1,2,3]

Let’s create a memory dump of this database. Make sure you don’t have example.bin already, as it will be overwritten.

wgdb 1011 export example.bin

Now we have these log files:

-rw-rw-rw- 1 user group  4 Dec  7 22:04 wgdb.journal.1011
-rw-rw-rw- 1 user group 27 Dec  7 22:00 wgdb.journal.1011.0

Note that the original file received the suffix .0 and a new one was created in it’s place. Let’s add more data:

wgdb 1011 add Hello world

The database now contains:

[1,2,3]
["Hello","world"]

Assume next that something destroyed our database. Try wgdb 1011 free. We now know that we have a recent dump called example.bin and some journals:

-rw-rw-rw- 1 user group 47 Dec  7 22:06 wgdb.journal.1011
-rw-rw-rw- 1 user group 27 Dec  7 22:00 wgdb.journal.1011.0

The newest journal is what we’re interested in. However, keep in mind that importing the dump restarts the journal. Nothing will happen to our precious log file, but in order to avoid confusion, let’s move it somewhere safe first:

mv -i ./logs/wgdb.journal.1011 ./logs/recover.me.1011

Import the dump (note that the -l switch is used to re-enable journaling so that our updates will be logged again after we’ve completed the recovery):

wgdb 1011 import -l example.bin

Try to list the database contents (wgdb 1011 select 10):

[1,2,3]

Finally, recover the log:

wgdb 1011 replay ./logs/recover.me.1011

And the database contents after this will be

[1,2,3]
["Hello","world"]

We managed to restore our latest state of the database by first importing the dump file, then replaying the journal file that was written after the dump was created. If we would now continue to modify the database, the state could be recovered by importing example.bin and then replaying recover.me.1011 and the latest journal file wgdb.journal.1011 in that order.

The recovery process creates a number of intermediate journals which we may now clean up by rm ./logs/wgdb.journal.1011.?. CAUTION: in this case, we moved the log file that we cared about, away first. This is why this command is safe to use here. In general, however, it is better to make sure that the deleted log files do not contain anything not backed up elsewhere. The easiest way would be to verify that the database is healthy and create a fresh memory dump.

An automated backup procedure

A reasonable procedure that helps keeping track of image dumps and journals could be implemented with a following script:

  1. dump the memory segment into a file such as backup.YYYYMMDD.shmname.bin
  2. check that the dump was successful
  3. move logdir/wgdb.journal.shmname.0 to journal.YYYYMMDD.shmname

This way, logdir always contains only the current journal. Assuming that YYYYMMDD represents the current date, memory dumps and journals can be archived and accessed by date. If recovery is needed, the database can be restored from the latest image and the current journal in logdir/wgdb.journal.shmname (which should first be archived separately). Any journal backups in logdir should then be removed after the restore is successful, to ensure that step 3. archives the correct journal file next time.

2. dserve - simple REST queries with json

!Note

dserve is work in progress and not part of the whitedb distribution yet.

This is a simple REST service tool taking a query described with cgi parameters and printing json or csv output.

dserve is useful both as a ready-made tool and as an example/template for making your own tools for WhiteDB data handling. dserve is not compiled by default. Find it under the Examples folder as a single source file dserve.c, modify and compile yourself by doing:

gcc dserve.c -o dserve -O2 -lwgdb

Copy the resulting executable under the cgi folder of the Apache server.

You can use dserve as a cgi program taking a few GET parameters like this

http://myserver.com/cgi-bin/dserve?op=search&from=0&count=5

or as a command line program: on the command line simply pass the urlencoded query string as a single argument, for example

dserve 'op=search&from=2&count=3'

This will print rows 2..4 of the database, like this:

content-length: 110
content-type: application/json

[
[1,1.1,"a simple string",[10,[],"point to me"],"2013-10-24","23:17:36.68"],
[12,"an:uri",-2000],
[23,-12]
]

Observe that the fourth field contains a pointer to another record, printed as a sublist. The same example with a small addition:

dserve 'op=search&from=2&count=3&showid=yes'

The added showid=yes parameter prepends automatic record id-s (encoded offsets) to each record as a first element:

content-length: 134
content-type: application/json

[
[23368,1,1.1,"a simple string",[23304,10,[],"point to me"],"2013-10-24","23:17:36.68"],
[23408,12,"an:uri",-2000],
[23432,23,-12]
]

The third example asks for three first records with field 0 less than 20:

dserve 'op=search&field=0&value=20&compare=lessthan&count=3'

content-length: 178
content-type: application/json

[
[10,[],"point to me"],
[0,0.1,"a simple string",[10,[],"point to me"],"2013-10-24","23:17:36.68"],
[1,1.1,"a simple string",[10,[],"point to me"],"2013-10-24","23:17:36.68"]
]

Dserve does not facilitate adding or updating records in the database. We may add such a cgi utility in the future, but anyway, it is a good idea to write your own utilities which fit your own needs exactly.

Most of the dserve parameters are optional, i.e. have sensible defaults. Limits, error messages etc are all configurable by changing the macro definitions at the beginning of the C source.

2.1. Query by field values

You have to indicate the field numbers and the values to compare the contents of these fields with. The whole set of cgi parameters is as follows:

  • db : numeric database name. Default 1000.
  • op : either search or recids. Here we assume search, will cover recids later.
  • field : field number (0,1,2, etc) to check against the compared value.
  • value : value to check the field contents against. Examples: value=32, value=sometext.
  • type : datatype of the value: null, int, double, str, char or record. Guessed from the value by default.
  • compare : equal, not_equal, lessthan, greater, ltequal or gtequal. Default equal.
  • from : skip initial matching records, start from the result nr given here. Default 0.
  • count : max number of matching records output. Default 10000.
  • depth : max depth of nested trees of records. Default 100, hard limit 10000. Set to 0 for hiding all sublists. Modify the macro MAX_DEPTH_HARD to change the hard limit.
  • format : either json or csv. Default json.
  • escape : escaping special characters in json strings, either no (replace nothing) , url (urlencode %, " and all non-printable and non-ascii characters, i.e. under 32 and over 126, to be completely ascii-safe) or json (not ascii-safe: replace only the minimal set indicated in the rfc). Default json. NB! this parameter has no effect for csv, where only " gets replaced with "".
  • *showid : yes or no, if yes, print the record offset (automatic id) as the first element of each record, moving all the other fields one position to the right. Default no.

All the input values are assumed to be urlencoded.

The json output is a list of matching records. Each record is also presented as a list. The list elements are integers, doubles, strings or records (again represented as lists) pointed to from the field. Null is printed as an empty list [].

All the other datatypes, like dates, times, URI-s, strings with a language attribute etc in WhiteDB are converted to standard strings in a fairly intuitive manner. Blobs are url-encoded. For csv and too deep json branches the full internal record is represented by its offset (encoded pointer, i.e. automatic id).

Notice that for a graph database the json output can be a complex tree of records. Cycles are possible, but can be inhibited by the depth parameter.

Errors are reported by printing an error string as a single element of the output list, both for json and csv, like this:

content-length: 48
content-type: application/json

["unrecognized op: use op=search or op=recids"]

On Linux dserve should be able to free locks and detach the database even in case of hard errors like segfaults.

Importantly, you can give several sets of field/value/compare/count parameters to perform a complex and-query. Example:

dserve 'op=search&field=1&value=100&type=int&compare=lessthan&field=2&value=10.3&type=double&compare=greater'

In case you use several fields in the query, you have to fill the otherwise optional type and compare parameters for all the fields indicated. BTW, it is OK to put the same field into several comparisons.

You can also have no fields in the query at all. In this case the full database will be traversed and printed according to the parameters given.

2.2. Query by record id-s

The second way to query is to simply indicate a list of record id-s (offsets: encoded record pointers) like this:

dserve 'op=recids&recids=12000,10236,22458'

You can learn the record id-s by using the showid=yes parameter described before.

The database selection parameter db and the generic formatting parameters depth,format,escape and showid function exactly as described above.

2.3. Good to know

Things to note and useful code examples in dserve:

  • does not require additional libraries except wgdb
  • uses readlocks, does not use writelocks
  • cgi parameter parsing
  • various printing routines for WhiteDB values
  • text is printed into a an automatically growing string buffer
  • error and timeout handling with signals: when a signal arrives, free the readlock, detach database and output ["internal error"] or ["timeout"]. Available on Linux only.