Skip to content

Tensorlog with Tensorflow: More Tricks

krivard edited this page Jun 15, 2017 · 5 revisions

The comments below assume you've created a compiler object tlog.

Table of Contents

Faster loading

It's much faster to load a database in serialized form. To serialize the database after you've loaded it, use tlog.serialize_db(directory_name). Here directory_name is a string which names a file (and files in it will be opened with Python's open routine.) The convention is that serialized database directories end in .db.

When you create a compiler, you can specify something like db="foo.db|foo.cfacts", and Tensorlog will use the serialized version foo.db if it exists, and otherwise parse the data in foo.cfacts and cache it into foo.db for later.

Saving What Was Learned

TensorLog's interactions with the data are a little complicated, because tlog.db is just one version of the database, defined as a bunch of sparse matrices. You can load these in serialized form, or write them back out. Some of the database relations are marked as parameters, so Tensorflow will update them during training; however, for various reasons, Tensorflow's updates are done on local copies of these relations, not the relations themselves. So if you want to save something you've learned as a serialized database, the proper incantations are:

  tlog.set_all_db_params_to_learned_values(session)
  tlog.serialize_db(directory_name)

If it's necessary to save things to a stream instead, it's a little more complicated. Instead of a directory, you need to set up two streams, one to hold schema information, and one to hold the actual data. (The schema stream schema_fp will be accessed by the method schema_fp.write(line_of_text) and the data stream will be passed along to scipy.io.savemat.) You can open the streams and incant

  tlog.db.serializeDataTo(data_stream)
  tlog.db.schema.serializeTo(schema_fp)

The tlog.db.serializeDataTo takes a keyword argument: filter="fixed" will save just non-trainable relations, and filter="params" just the trainable ones. The default is to save everything.

Typed relations and minibatches

Tensorlog in native mode uses all sparse matrices. Tensorflow can only handle sparse matrices in limited ways, so some of the data is stored in dense form, which uses more storage: roughly speaking, the inputs and outputs to inference functions are dense, and the matrices that make up the relations stay sparse. (Note that it's ok to have a sparse matrix be a learnable parameter.) Typically the inputs will be 1-one vectors that indicate a particular database entities, and the outputs will be distributions over database entities.

There are two tricks to make the dense storage more practical. The first trick is using minibatches to convert a set of examples (in sparse format) into dense format (for Tensorflow) a little at a time, rather than all at once. There's an example of this in Tensorlog/datasets/wikimovies/demo.py but the basic idea is below:

    train_data = tlog.load_big_dataset('mydata.exam')
    for i in range(epochs):
      for mode,(x,y) in tlog.minibatches(train_data,batch_size=128):
        train_batch_fd = {tlog.input_placeholder_name(mode):x, tlog.target_output_placeholder_name(mode):y}
        session.run(train_step, feed_dict=train_batch_fd)

The second trick is to set up your program and database to use types, as described in the database docs. Then the input/output vectors will be distributions over database entities of some particular type, which means that they are shorter. To type relations, you need to insert type declarations in the .cfacts which defines it. If you type anything, you need to type everything. This includes the predicate that is associated with your training and test examples (path in the example from the last section), but not intermediate predicates that your rules might define.

Using Tensorlog inference functions inside a Tensorflow expression

Using Tensorflow expressions inside a Tensorlog inference function

Clone this wiki locally