Skip to content

configuration configuration partition

Jian Zhang (James) edited this page May 17, 2023 · 4 revisions

Graph Partition#

For users who are already familiar with DGL and know how to construct DGL graph, GraphStorm provides two graph partition tools to partition DGL graphs into the required input format for GraphStorm launch tool for training and inference.

  • –dataset: (Required) the graph dataset name defined for the saved DGL graph file.

  • –filepath: (Required) the file path of the saved DGL graph file.

  • –target_ntype: the node type for making prediction, required for node classification/regression tasks. This argument is associated with the node type having labels. Current GraphStorm supports one predict node type only.

  • –ntype_task: the node type task to perform. Only support classification and regression so far. Default is classification.

  • –nlabel_field: the field that stores labels on the predict node type, required if set the target_ntype. The format is nodetype:labelname, e.g., “paper:label”.

  • –target_etype: the canonical edge type for making prediction, required for edge classification/regression tasks. This argument is associated with the edge type having labels. Current GraphStorm supports one predict edge type only. The format is src_ntype,etype,dst_ntype, e.g., “author,write,paper”.

  • –etype_task: the edge type task to perform. Only allow classification and regression so far. Default is classification.

  • –elabel_field: the field that stores labels on the predict edge type, required if set the target_etype. The format is src_ntype,etype,dst_ntype:labelname, e.g., “author,write,paper:label”.

  • –generate_new_node_split: a boolean value, required if need the partition script to split nodes for training/validation/test sets. If set this argument true, must set the target_ntype argument too.

  • –generate_new_edge_split: a boolean value, required if need the partition script to split edges for training/validation/test sets. If set this argument true, you must set the target_etype argument too.

  • –train_pct: a float value (>0. and <1.) with default value 0.8. If you want the partition script to split nodes/edges for training/validation/test sets, you can set this value to control the percentage of nodes/edges for training.

  • –val_pct: a float value (>0. and <1.) with default value 0.1. You can set this value to control the percentage of nodes/edges for validation.

Note

The sum of the train_pct and val_pct should be less than 1. And the percentage of test nodes/edges is the result of 1-(train_pct + val_pct).

  • –undirected: if add this argument, will add reverse edges to the given graph.

  • –retain_original_features: boolean value to control if use the original features generated by dataset, e.g., embeddings of paper abstracts. If set to true, will keep the original features; otherwise we will use the tokenized text for using BERT models to generate embeddings.

  • –num_parts: (Required) integer value that specifies partitions the DGL graph to be split. Remember this number because we will need to set it in the model training step.

  • –output: (Required) the folder path that the partitioned DGL graph will be saved.

  • –dataset: (Required) the graph name defined for the saved DGL graph file.

  • –filepath: (Required) the file path of the saved DGL graph file.

  • –target_etypes: (Required) the canonical edge type for making prediction. GraphStorm supports one predict edge type only. The format is src_ntype,etype,dst_ntype, e.g., “author,write,paper”.

  • –train_pct: a float value (>0. and <1.) with default value 0.8. If you want the partition script to split nodes/edges for training/validation/test sets, you can set this value to control the percentage of nodes/edges for training.

  • –val_pct: a float value (>0. and <1.) with default value 0.1. You can set this value to control the percentage of nodes/edges for validation.

Note

The sum of the train_pct and val_pct should less than 1. And the percentage of test nodes/edges is the result of 1-(train_pct + val_pct).

  • –undirected: if add this argument, will add reverse edges to the given graphs.

  • –train_graph_only: boolean value to control if partition the training graph or not, default is true.

  • –retain_original_features: boolean value to control if use the original features generated by dataset, e.g., embeddings of paper abstracts. If set to true, will keep the original features; otherwise we will use the tokenized text for using BERT models to generate embeddings.

  • –retain_etypes: the list of canonical edge type that will be retained before partitioning the graph. This might be helpful to remove noise edges in this application. Format example: —-retain_etypes query,clicks,asin query,adds,asin query,purchases,asin asin,rev-clicks,query.

  • –num_parts: (Required) integer value that specifies partitions the DGL graph to be split. Remember this number because we will need to set it in the model training step.

  • –output: (Required) the folder path that the partitioned DGL graph will be saved.