Skip to content

v0.8.0

Compare
Choose a tag to compare
@lukehsiao lukehsiao released this 08 Apr 00:40
· 200 commits to master since this release

0.8.0 - 2020-04-07

A summary of the changes of this release are below. Check the Changelog for more details.

Rather than maintaining a separate learning engine, we switch to Emmental, a deep learning framework for multi-task learning. Switching to a more general learning framework allows Fonduer to support more applications and multi-task learning.

Click to see example usage
# With Emmental, you need do following steps to perform learning:
# 1. Create task for each relations and EmmentalModel to learn those tasks.
# 2. Wrap candidates into EmmentalDataLoader for training.
# 3. Training and inference (prediction).

import emmental

# Collect word counter from candidates which is used in LSTM model.
word_counter = collect_word_counter(train_cands)

# Initialize Emmental. For customize Emmental, please check here:
# https://emmental.readthedocs.io/en/latest/user/config.html
emmental.init(fonduer.Meta.log_path)

#######################################################################
# 1. Create task for each relations and EmmentalModel to learn those tasks.
#######################################################################

# Generate special tokens which are used for LSTM model to locate mentions.
# In LSTM model, we pad sentence with special tokens to help LSTM to learn
# those mentions. Example:
# Original sentence: Then Barack married Michelle.
# ->  Then ~~[[1 Barack 1]]~~ married ~~[[2 Michelle 2]]~~.
arity = 2
special_tokens = []
for i in range(arity):
    special_tokens += [f"~~[[{i}", f"{i}]]~~"]

# Generate word embedding module for LSTM.
emb_layer = EmbeddingModule(
    word_counter=word_counter, word_dim=300, specials=special_tokens
)

# Create task for each relation.
tasks = create_task(
    ATTRIBUTE,
    2,
    F_train[0].shape[1],
    2,
    emb_layer,
    mode="mtl",
    model="LogisticRegression",
)

# Create Emmental model to learn the tasks.
model = EmmentalModel(name=f"{ATTRIBUTE}_task")

# Add tasks into model
for task in tasks:
    model.add_task(task)

#######################################################################
# 2. Wrap candidates into EmmentalDataLoader for training.
#######################################################################

# Here we only use the samples that have labels, which we filter out the
# samples that don't have significant marginals.
diffs = train_marginals.max(axis=1) - train_marginals.min(axis=1)
train_idxs = np.where(diffs > 1e-6)[0]

# Create a dataloader with weakly supervisied samples to learn the model.
train_dataloader = EmmentalDataLoader(
    task_to_label_dict={ATTRIBUTE: "labels"},
    dataset=FonduerDataset(
        ATTRIBUTE,
        train_cands[0],
        F_train[0],
        emb_layer.word2id,
        train_marginals,
        train_idxs,
    ),
    split="train",
    batch_size=100,
    shuffle=True,
)


# Create test dataloader to do prediction.
# Build test dataloader
test_dataloader = EmmentalDataLoader(
    task_to_label_dict={ATTRIBUTE: "labels"},
    dataset=FonduerDataset(
        ATTRIBUTE, test_cands[0], F_test[0], emb_layer.word2id, 2
    ),
    split="test",
    batch_size=100,
    shuffle=False,
)


#######################################################################
# 3. Training and inference (prediction).
#######################################################################

# Learning those tasks.
emmental_learner = EmmentalLearner()
emmental_learner.learn(model, [train_dataloader])

# Predict based the learned model.
test_preds = model.predict(test_dataloader, return_preds=True)

Changed

  • @senwu: Switch to Emmental as the default learning engine.
  • @HiromuHota: Change ABSTAIN to -1 to be compatible with Snorkel of 0.9.X. Accordingly, user-defined labels should now be 0-indexed (used to be 1-indexed). (#310) (#320)
  • @HiromuHota: Use executemany_mode="batch" instead of deprecated use_batch_mode=True. (#358)
  • @HiromuHota: Use tqdm.notebook.tqdm instead of deprecated tqdm.tqdm_notebook. (#360)
  • @HiromuHota: To support ImageMagick7, expand the version range of Wand. (#373)
  • @HiromuHota: Comply with PEP 561 for type-checking codes that use Fonduer.
  • @HiromuHota: Make UDF.apply of all child classes unaware of the database backend, meaning PostgreSQL is not required if UDF.apply is directly used instead of UDFRunner.apply. (#316) (#368)

Fixed

  • @senwu: Fix mention extraction to return mention classes instead of data model classes.