Skip to content

advanced language models

Jian Zhang (James) edited this page Jun 25, 2023 · 16 revisions

Use Text as Node Features#

Many real world graphs have text contents as nodes’ features, e.g., the title and description of a product, and the questions and comments from users. To leverage these text contents, GraphStorm supports language models (LMs), i.e., HuggingFace BERT models, to embed text contents and use these embeddings in Graph models’ training and inference.

There are two modes of using LMs in GraphStorm:

  • Embed text contents with pre-trained LMs, and then use them as the input node features, without fine-tuning the LMs. Training speed in this mode is fast, and memory consumption will be lower. However, in some cases, pre-trained LMs may not fit to the graph data well, and fail to improve performance.

  • Co-train both LMs and GML models in the same training loop. This will fine-tune the LMs to fit to graph data. In many cases this mode can improve performance, but co-train the LMs will consume much more memory, particularly GPU memory, and take much longer time to complete training loops.

To use LMs in GraphStorm, users can follow the same procedure as the Use Your Own Data tutorial with some minor changes.

  • Step 1. Prepare raw data to include texts as node data;

  • Step 2. Use GraphStorm graph construction tools to tokenize texts and set tokens as node features;

  • Step 3. Configure GraphStorm to use LMs to embed tokenized texts as input node features; and

  • Step 4. If need, configure GraphStorm to co-train LM and GNN models.

Note

All commands below are designed to run in a GraphStorm Docker container. Please refer to the GraphStorm Docker environment setup to prepare your environment.

If you set up the GraphStorm environment with pip Packages, please replace all occurrences of “2222” in the argument --ssh-port with 22, and clone GraphStorm toolkits.

Prepare Raw Data#

This tutorial will use the same ACM data as the Use Your Own Data tutorial to demonstrate how to prepare text as node features.

First go the /graphstorm/examples/ folder.

cd /graphstorm/examples

Then run the command to create the ACM data with the required raw_w_text format.

python3 -m /graphstorm/examples/acm_data.py --output-path /tmp/acm_raw --output-type raw_w_text

Once successful, the command will create a set of folders and files under the /tmp/acm_raw/ folder ,similar to the outputs in the Use Your Own Data tutorial. But the contents of the config.json file have a few extra lines that list the text feature columns and specify how they should be processed during graph contruction.

The following snippet shows the information of author nodes. It indicates that the “text” column contains text features, and it require the GraphStorm’s graph contruction tool to use a HuggingFace BERT model named bert-base-uncased to tokenize these text features during construction.

"nodes": [
    {
        "node_type": "author",
        "format": {
            "name": "parquet"
        },
        "files": [
            "/tmp/acm_raw/nodes/author.parquet"
        ],
        "node_id_col": "node_id",
        "features": [
            {
                "feature_col": "feat",
                "feature_name": "feat"
            },
            {
                "feature_col": "text",
                "feature_name": "text",
                "transform": {
                    "name": "tokenize_hf",
                    "bert_model": "bert-base-uncased",
                    "max_seq_length": 16
                }
            }
        ]
    }

Construct Graph#

Then we use the graph construction tool to process the ACM raw data with the following command for GraphStorm model training.

python3 -m graphstorm.gconstruct.construct_graph \
           --conf-file /tmp/acm_raw/config.json \
           --output-dir /tmp/acm_nc \
           --num-parts 1 \
           --graph-name acm

Outcomes of this command are also same as the Outputs of Graph Construction. But users may notice that the paper, author, and subject nodes all have three additional features, named input_ids,``attention_mask``, and token_type_ids, which are generated by the BERT tokenizer.

GraphStorm Language Model Configuration#

Users can set up language model in GraphStorm’s configuration YAML file. Below is an example of such configuration for the ACM data. The full configuration YAML file, acm_lm_nc.yaml, is located under GraphStorm’s examples/use_your_own_data folder.

lm_model:
node_lm_models:
  -
    lm_type: bert
    model_name: "bert-base-uncased"
    gradient_checkpoint: true
    node_types:
      - paper
      - author
      - subject

The current version of GraphStorm supports pre-trained BERT models from HuggingFace reposity on nodes only. Users can choose any HuggingFace BERT models. But the value of model_name MUST be the same as the one specified in the raw data JSON file’s bert_model field. Here in the example, it is the bert-base-uncased model.

The node_type field lists the types of nodes that have tokenized text features. In this ACM example, all three types of nodes have tokenized text features, which are all list in the configuration YAML file.

Launch GraphStorm Trainig without Fine-tuning BERT Models#

With the above GraphStorm configuration YAML file, we can launch GraphStorm model training with the same commands as in the Step 3: Launch training script on your own graphs.

First, we create the ip_list.txt file for the standalone mode.

touch /tmp/ip_list.txt
echo 127.0.0.1 > /tmp/ip_list.txt

Then, the launch command is almost the same except that in this case the configuration file is acm_lm_nc.yaml, which contains the language model configurations.

python3 -m graphstorm.run.gs_node_classification \
        --workspace /tmp \
        --part-config /tmp/acm_nc/acm.json \
        --ip-config /tmp/ip_list.txt \
        --num-trainers 4 \
        --num-servers 1 \
        --num-samplers 0 \
        --ssh-port 2222 \
        --cf /tmp/acm_lm_nc.yaml \
        --save-model-path /tmp/acm_nc/models \
        --node-feat-name paper:feat author:feat subject:feat

In the training process, GraphStorm will first use the specified BERT model to compute the text embeddings in the specified node types. And then the text embeddings and other node features are concatenated together as the input node feature for GNN models training.

Launch GraphStorm Trainig for both BERT and GNN Models#

To co-train BERT and GNN models, we need to add one more argument, --lm-train-nodes, to either the launch command or the configuration YAML file. Below command sets this argument to the launch command.

python3 -m graphstorm.run.gs_node_classification \
        --workspace /tmp \
        --part-config /tmp/acm_nc/acm.json \
        --ip-config /tmp/ip_list.txt \
        --num-trainers 4 \
        --num-servers 1 \
        --num-samplers 0 \
        --ssh-port 2222 \
        --cf /tmp/acm_lm_nc.yaml \
        --save-model-path /tmp/acm_nc/models \
        --node-feat-name paper:feat author:feat subject:feat \
        --lm-train-nodes 10

The --lm-train-nodes argument determines how many nodes will be used in each mini-batch per GPU to tune the BERT models. Because the BERT models are normally large, training of them will consume many memories. If use all nodes to co-train BERT and GNN models, it could cause GPU out of memory (OOM) errors. Use a smaller number for the --lm-train-nodes could reduce the overall GPU memory consumption.

Note

It will take longer time to co-train BERT and GNN models compared to no co-train.

Only Use BERT Models#

GraphStorm also allows users to only use BERT models to perform graph tasks. We can add another argument, --lm-encoder-only, to control whether only use BERT models or not.

If users want to fine tune the BERT model only, just add the --lm-train-nodes argument as the command below:

python3 -m graphstorm.run.gs_node_classification \
        --workspace /tmp \
        --part-config /tmp/acm_nc/acm.json \
        --ip-config /tmp/ip_list.txt \
        --num-trainers 4 \
        --num-servers 1 \
        --num-samplers 0 \
        --ssh-port 2222 \
        --cf /tmp/acm_lm_nc.yaml \
        --save-model-path /tmp/acm_nc/models \
        --node-feat-name paper:feat author:feat subject:feat \
        --lm-encoder-only \
        --lm-train-nodes 10

Note

The current version of GraphStorm requires ALL node types must have text features when users want to do the above graph-aware LM fine-tuning only.