-
Notifications
You must be signed in to change notification settings - Fork 22
Tensorlog with Tensorflow: More Tricks
The comments below assume you've created a compiler object tlog
.
It's much faster to load a database in serialized form. To serialize the database after you've loaded it, use tlog.serialize_db(directory_name)
. Here directory_name
is a string which names a file (and files in it will be opened with Python's open
routine.) The convention is that serialized database directories end in .db
.
When you create a compiler, you can specify something like db="foo.db|foo.cfacts"
, and Tensorlog will use the serialized version foo.db
if it exists, and otherwise parse the data in foo.cfacts
and cache it into foo.db
for later.
TensorLog's interactions with the data are a little complicated, because tlog.db
is just one version of the database, defined as a bunch of sparse matrices. You can load these in serialized form, or write them back out. Some of the database relations are marked as parameters, so Tensorflow will update them during training; however, for various reasons, Tensorflow's updates are done on local copies of these relations, not the relations themselves. So if you want to save something you've learned as a serialized database, the proper incantations are:
tlog.set_all_db_params_to_learned_values(session) tlog.serialize_db(directory_name)
If it's necessary to save things to a stream instead, it's a little more complicated. Instead of a directory, you need to set up two streams, one to hold schema information, and one to hold the actual data. (The schema stream schema_fp
will be accessed by the method schema_fp.write(line_of_text)
and the data stream will be passed along to scipy.io.savemat
.) You can open the streams and incant
tlog.db.serializeDataTo(data_stream) tlog.db.schema.serializeTo(schema_fp)
The tlog.db.serializeDataTo
takes a keyword argument: filter="fixed"
will save just non-trainable relations, and filter="params"
just the trainable ones. The default is to save everything.
Tensorlog in native mode uses all sparse matrices. Tensorflow can only handle sparse matrices in limited ways, so some of the data is stored in dense form, which uses more storage: roughly speaking, the inputs and outputs to inference functions are dense, and the matrices that make up the relations stay sparse. (Note that it's ok to have a sparse matrix be a learnable parameter.) Typically the inputs will be 1-one vectors that indicate a particular database entities, and the outputs will be distributions over database entities.
There are two tricks to make the dense storage more practical. The first trick is using minibatches to convert a set of examples (in sparse format) into dense format (for Tensorflow) a little at a time, rather than all at once. There's an example of this in Tensorlog/datasets/wikimovies/demo.py but the basic idea is below:
train_data = tlog.load_big_dataset('mydata.exam') for i in range(epochs): for mode,(x,y) in tlog.minibatches(train_data,batch_size=128): train_batch_fd = {tlog.input_placeholder_name(mode):x, tlog.target_output_placeholder_name(mode):y} session.run(train_step, feed_dict=train_batch_fd)
The second trick is to set up your program and database to use types, as described in the database docs. Then the input/output vectors will be distributions over database entities of some particular type, which means that they are shorter. To type relations, you need to insert type declarations in the .cfacts
which defines it. If you type anything, you need to type everything. This includes the predicate that is associated with your training and test examples (path
in the example from the last section), but not intermediate predicates that your rules might define.