Skip to content
krivard edited this page Jan 16, 2014 · 1 revision

Building a dataset

Writing a logic program

You might start with a dataset, or with a set of queries (at least partially labeled with correct responses), or both.

(example)

Figure out what relationships your dataset encodes, and/or what relationships between facts you want to use to answer the queries.

(example)

Express the list of queries as prolog goals, one to a line, in a .queries file.

(example)

Express the knowledge base as a set of facts files. If the predicate has 1 argument, or 3 or more arguments, you have to put it in a .facts file. 2-argument predicates can be stored more efficiently as a graph or sparse graph.

(example)

Express the logical rules you wish the computer to employ as a set of prolog rules in a .rules file.

(example)

Generating Training Data

For each training example (query) ProPPR needs at least one positive and at least one negative example. Generating positive examples is easy if you already have the correct answers. Negative examples in ProPPR are more challenging, since you have to find solutions to the query that are reachable through the logic program but are nonetheless incorrect solutions. If you just put "reasonable" negative examples in the training file, you're not guaranteed to ever see them in output, and ProPPR won't actually be able to use them in training.

Take the queries for which you know the correct answers, and run them on the logic program using QueryAnswerer. Process the solutions file to determine whether ProPPR could obtain the correct answer to each query. For now, we're not concerned with the ranking of the correct answer, only if the pool of logical solutions to a query included the gold/positive solution. If the correct answer was reachable from the query for only a few queries, you will want to re-write your rules file or add more facts to the KB.

Once you're satisfied with the recall of your logic program, drop the queries whose correct answers could not be reached or that only showed one solution (since remember we need at least two). Collect negative examples for each query by taking every incorrect solution whose rank is a perfect square (so the 1st, 4th, 9th, 16th, 25th, etc, until you run out of incorrect answers for the query). This method preferentially samples high-ranked negative examples without ignoring low-ranked negative examples, and was introduced to our lab by Ni Lao.

Construct a training file which includes on each line a query, its correct answer(s) (+), and the sampled incorrect answers reachable through the logic program (-).