diff --git a/docs/source/advanced/advanced-usages.rst b/docs/source/advanced/advanced-usages.rst deleted file mode 100644 index bccb4d86d6..0000000000 --- a/docs/source/advanced/advanced-usages.rst +++ /dev/null @@ -1,47 +0,0 @@ -.. _advanced_usages: - -GraphStorm Advanced Usages -=========================== - -Multiple Target Node Types Training -------------------------------------- - -When training on a hetergenious graph, we often need to train a model by minimizing the objective function on more than one node type. GraphStorm provides supports to achieve this goal. - -- Train on multiple node types: The users only need to edit the ``target_ntype`` in model config YAML file to minimize the objective function defined on mutiple target node types. For example, by setting ``target_ntype`` as following, we can jointly optimize the objective function defined on "movie" and "user" node types. - - .. code-block:: yaml - - target_ntype: - - movie - - user - - During evuation, the users can set a single node type for evaluation. For example, by setting ``eval_target_ntype: movie``, we will only perform evaluation on "movie" node type. - -- Evaluate on single node type: During evuation, the users can set a single node type for evaluation. For example, by setting ``eval_target_ntype: movie``, we will only perform evaluation on "movie" node type. Our current implementation only support evaluating on a single node type. - -- Per target node type decoder: The users may also want to use a different decoder on each node type, where the output dimension for each decoder maybe different. We can achieve this by setting ``num_classes`` in model config YAML file. For example, by setting ``num_classes`` as following, GraphStorm will create a decoder with output dimension as 3 for movie node type, and a decoder with output dimension as 7 for user node type. - - .. code-block:: yaml - - num_classes: - movie: 3 - user: 7 - -- Reweighting on loss function: The users may also want to use a customized loss function reweighting on each node type, which can be achieved by setting ``multilabel``, ``multilabel_weights``, and ``imbalance_class_weights``. Examples are illustrated as following. Our current implementation does not support different node types with different ``multilabel`` setting. - - .. code-block:: yaml - - multilabel: - movie: true - user: true - multilabel_weights: - movie: 0.1,0.2,0.3 - user: 0.1,0.2,0.3,0.4,0.5,0.0 - - multilabel: - movie: false - user: false - imbalance_class_weights: - movie: 0.1,0.2,0.3 - user: 0.1,0.2,0.3,0.4,0.5,0.0 diff --git a/docs/source/advanced/imbalanced-labels.rst b/docs/source/advanced/imbalanced-labels.rst new file mode 100644 index 0000000000..f30c125938 --- /dev/null +++ b/docs/source/advanced/imbalanced-labels.rst @@ -0,0 +1,73 @@ +.. _imbalanced_labels: + +Deal with Imbalanced Labels in Classification/Regression +========================================== + +In some cases, the number of labels of different classes could be imbalanced, i.e., some classes +have either too many or too few data points. For example, most fraud detection tasks only have a +small number of fraudulent activities (positive labels) versus a huge number of legitimate activities +(negative labels). Even in regression tasks, it is possible to encounter many dominant values that +can cause imbalanced labels. If not handled properly, these imbalanced labels could impact classification/regression +model performance a lot. For example, because too many negative labels are fit into models, models +may learn to classify all unseen samples as negative. GraphStorm +provides several ways to tackle the class imbalance problem. + +For classification tasks, users can configure two arguments in command line interfaces (CLIs), the +``imbalance_class_weights`` and ``class_loss_func``. + +The ``imbalance_class_weights`` allows users to give scale weights for each class, hence forcing models +to learn more on the classes with higher scale weight. For example, if there are 10 positive labels versus +90 negative labels, you can set ``imbalance_class_weights`` to be ``0.1, 0.9``, meaning class 0 (usually +for negative labels) has weight ``0.1``, and class 1 (usually for positive labels) has weight ``0.9``. +This places more importance on correctly classifying positive samples and less on negative ones. Below +is an example about how to set the ``imbalance_class_weights`` in a YAML configuration file. + + .. code-block:: yaml + + imbalance_class_weights: 0.1,0.9 + +You can also set ``focal`` as the ``class_loss_func`` configuration's value, which will use the +`focal loss function `_ in binary classification tasks. The focal loss +function is designed for imbalanced classes. Its formula is :math:`loss(p_t) = -\alpha_t(1-p_t)^{\gamma}log(p_t)`, +where :math:`p_t=p`, if :math:`y=1`, otherwise, :math:`p_t = 1-p`. Here :math:`p` is the probability of output +in a binary classification. This function has two hyperparameters, :math:`\alpha` and :math:`\gamma`, +corresponding to the ``alpha`` and ``gamma`` configuration in GraphStorm. Larger values of ``gamma`` will help +update models on harder cases so as to detect more positive samples if the positive to negative ratio is small. +There is no clear guideline for values of ``alpha``. You can use its default value(``0.25``) first, and then +search for optimal values. Below is an example about how to set the `focal loss funciton` in a YAML configuration file. + + .. code-block:: yaml + + class_loss_func: focal + + gamma: 10.0 + alpha: 0.5 + +Apart from focal loss and class weights, you can also output the classification results as probabilities of positive and negative +classes by setting the value of ``return_proba`` configuration to be ``true``. By default GraphStorm outputs +classification results using the argmax values, e.g., either 0s or 1s in binary tasks, which equals to using +``0.5`` as the threshold to classify negative from positive samples. With probabilities as outputs, you can use +different thresholds, hence being able to achieve desired outcomes. For example, if you need higher recall to catch +more suspicious positive samples, a smaller threshold, e.g., "0.25", could classify more positive cases. Or you may +use methods like `ROC curve` or `Precision-Recall curve` to determine the optimal threshold. Below is an example about how +to set the ``return_proba`` in a YAML configuration file. + + .. code-block:: yaml + + return_proba: true + +For regression tasks where there are some dominant values, e.g., 0s, in labels, GraphStorm provides the +`shrinkage loss function `_, +which can be set by using ``shrinkage`` as value of the ``regression_loss_func`` configuration. Its formula is +:math:`loss = l^2/(1 + \exp \left( \alpha \cdot (\gamma - l)\right))`, where :math:`l` is the absolute difference +between predictions and labels. The shrinkage loss function also has the :math:`\alpha` and :math:`\gamma` hyperparameters. +You can use the same ``alpha`` and ``gamma`` configuration as the focal loss function to modify their values. The shrinkage +loss penalizes the importance of easy samples (when :math:`l < 0.5`) and keeps the loss of hard samples unchanged. Below is +an example about how to set the `shrinkage loss function` in a YAML configuration file. + + .. code-block:: yaml + + regression_loss_func: shrinkage + + gamma: 0.2 + alpha: 5 diff --git a/docs/source/advanced/multi-target-ntypes.rst b/docs/source/advanced/multi-target-ntypes.rst new file mode 100644 index 0000000000..9681e8cad7 --- /dev/null +++ b/docs/source/advanced/multi-target-ntypes.rst @@ -0,0 +1,196 @@ +.. _multi_target_ntypes: + +Multiple Target Node Types Training +=================================== + +When training on a heterogeneous graph, we often need to train a model by minimizing the objective +function on more than one node type. GraphStorm provides supports to achieve this goal. The recommended +method is to leverage GraphStorm's multi-task learning method, i.e., using multiple node tasks, and each +trained on one target node type. + +More detailed guide of using multi-task learning can be found in +:ref:`Multi-task Learning in GraphStorm`. This guide provides two examples of how +to conduct two target node type classification training on the `movielen 100k `_ +data, where the **movie** ("item" in the original data) and **user** node types have classification +labels associated. + +Using multi-task learning for multiple target node types training (Recommended) +-------------------------------------------------------------------------------- + +Preparing the training data +............................ + +During graph construction step, you can define two classification tasks on the two node types as +shown in the JSON example below. + +.. code-block:: json + + { + "version": "gconstruct-v0.1", + "nodes": [ + { + "node_type": "movie", + ...... + ], + "labels": [ + { + "label_col": "label_movie", + "task_type": "classification", + "split_pct": [0.8, 0.1, 0.1], + "mask_field_names": ["train_mask_movie", + "val_mask_movie", + "test_mask_movie"] + }, + ] + }, + { + "node_type": "user", + ...... + ], + "labels": [ + { + "label_col": "label_user", + "task_type": "classification", + "split_pct": [0.2, 0.2, 0.6], + "mask_field_names": ["train_mask_user", + "val_mask_user", + "test_mask_user"] + }, + ] + }, + ], + ...... + } + +The above configuration defines two classification tasks for the **movie** nodes and **user** nodes, respectively. +Each node type has its own "lable_col" and train/validation/test mask fields associated. Then you can +follow the instructions in :ref:`Run graph construction` to use the GraphStorm +construction tool for creating partitioned graph data. + +Define multi-task for model training +............................... + +Now, you can specify two training tasks by providing the `multi_task_learning` configurations in +the training configuration YAML file, like the example below. + +.. code-block:: yaml + + --- + version: 1.0 + gsf: + basic: + ... + multi_task_learning: + - node_classification: + target_ntype: "movie" + label_field: "label_movie" + mask_fields: + - "train_mask_movie" + - "val_mask_movie" + - "test_mask_movie" + num_classes: 10 + task_weight: 0.5 + - node_classification: + target_ntype: "user" + label_field: "label_user" + mask_fields: + - "train_mask_user" + - "val_mask_user" + - "test_mask_user" + task_weight: 1.0 + ... + +The above configuration defines one classification task for the **movie** node type and another one +for the **user** node type. The two node classification tasks will take their own label name, i.e., +`label_movie` and `label_user`, and their own train/validation/test mask fields. It also defines +which prioritizes user node classification (task_weight = 1.0) over movie node classification (task_weight = 0.5). +(`task_weight = 1.0`) than classification on **movie** nodes (`task_weight = 0.5`). + +Run multi-task model training +.............................. + +You can use the `graphstorm.run.gs_multi_task_learning` command to run multi-task learning tasks, +like the following example. + +.. code-block:: bash + + python -m graphstorm.run.gs_multi_task_learning \ + --workspace \ + --num-trainers 1 \ + --num-servers 1 \ + --part-config \ + --cf \ + +Run multi-task model Inference +............................... + +For inference, you can use the same command line `graphstorm.run.gs_multi_task_learning` with an +additional argument `--inference` as the following: + +.. code-block:: bash + + python -m graphstorm.run.gs_multi_task_learning \ + --inference \ + --workspace \ + --num-trainers 1 \ + --num-servers 1 \ + --part-config \ + --cf \ + --save-prediction-path + +The prediction results of each prediction tasks will be saved into different sub-directories under +. The sub-directories are prefixed with the `__`. + +Using multi-target node type training (Not Recommended) +------------------------------------------------------- + +You can also use GraphStorm's multi-target node types configuration. But this method is less +flexible than the multi-task learning method. + +- Train on multiple node types: The users only need to edit the ``target_ntype`` in model config +YAML file to minimize the objective function defined on mutiple target node types. For example, +by setting ``target_ntype`` as following, we can jointly optimize the objective function defined +on "movie" and "user" node types. + + .. code-block:: yaml + + target_ntype: + - movie + - user + +- During evaluation, the users need to choose a single node type. For example, by setting + ``eval_target_ntype: movie``, we will only perform evaluation on "movie" node type. GraphStorm + only supports evaluating on a single node type. + +- Per target node type decoder: The users may also want to use a different decoder on each node type, + where the output dimension for each decoder maybe different. We can achieve this by setting ``num_classes`` + in model config YAML file. For example, by setting ``num_classes`` as following, GraphStorm will + create a decoder with an output dimension as 3 for movie node type, and a decoder with an output + dimension as 7 for user node type. + + .. code-block:: yaml + + num_classes: + movie: 3 + user: 7 + +- Reweighting on loss function: The users may also want to use a customized loss function reweighting + on each node type, which can be achieved by setting ``multilabel``, ``multilabel_weights``, and + ``imbalance_class_weights``. Examples are illustrated as following. Our current implementation does + not support different node types with different ``multilabel`` setting. + + .. code-block:: yaml + + multilabel: + movie: true + user: true + multilabel_weights: + movie: 0.1,0.2,0.3 + user: 0.1,0.2,0.3,0.4,0.5,0.0 + + multilabel: + movie: false + user: false + imbalance_class_weights: + movie: 0.1,0.2,0.3 + user: 0.1,0.2,0.3,0.4,0.5,0.0 diff --git a/docs/source/advanced/multi-task-learning.rst b/docs/source/advanced/multi-task-learning.rst index a58f0337cb..e19050d1a3 100644 --- a/docs/source/advanced/multi-task-learning.rst +++ b/docs/source/advanced/multi-task-learning.rst @@ -277,15 +277,15 @@ You can define an edge feature reconstruction task as the following example: eval_metric: - "mse" -In the configuration, `target_etype` defines the target edge type to which the reconstruct edge feature -learning will be applied. `reconstruct_efeat_name`` defines the name of the feature to be +In the configuration, `target_etype` defines the target edge type to which the reconstruct edge +feature learning will be applied. `reconstruct_efeat_name`` defines the name of the feature to be reconstructed. The other configs are same as edge regression tasks. Run Model Training ~~~~~~~~~~~~~~~~~~~ -GraphStorm introduces a new command line `graphstorm.run.gs_multi_task_learning` with an additional -argument `--inference` to run multi-task learning tasks. You can use the following command to start a multi-task training job: +GraphStorm introduces a new command line `graphstorm.run.gs_multi_task_learning` to run multi-task +learning tasks. You can use the following command to start a multi-task training job: .. code-block:: bash @@ -298,7 +298,8 @@ argument `--inference` to run multi-task learning tasks. You can use the followi Run Model Inference ~~~~~~~~~~~~~~~~~~~~ -You can use the same command line `graphstorm.run.gs_multi_task_learning` to run inference as following: +You can use the same command line `graphstorm.run.gs_multi_task_learning` with an additional +argument `--inference` to run inference as following: .. code-block:: bash @@ -312,7 +313,8 @@ You can use the same command line `graphstorm.run.gs_multi_task_learning` to run --save-prediction-path The prediction results of each prediction tasks (node classification, node regression, -edge classification and edge regression) will be saved into different sub-directories under PATH_TO_OUTPUT. The sub-directories are prefixed with the `__`. +edge classification and edge regression) will be saved into different sub-directories under PATH_TO_OUTPUT. +The sub-directories are prefixed with the `__`. Run Model Training on SageMaker ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/source/cli/model-training-inference/configuration-run.rst b/docs/source/cli/model-training-inference/configuration-run.rst index 7c8143d66f..36105c8f24 100644 --- a/docs/source/cli/model-training-inference/configuration-run.rst +++ b/docs/source/cli/model-training-inference/configuration-run.rst @@ -397,14 +397,14 @@ General Configurations - For link prediction tasks, the default value is ``mrr``. - **gamma**: Set the value of the hyperparameter denoted by the symbol gamma. Gamma is used in the following cases: i/ focal loss for binary classification ii/ DistMult score function for link prediction, iii/ TransE score function for link prediction, iv/ RotatE score function for link prediction, v/ shrinkage loss for regression. - - Yaml: ``gamma: 10.0`` - - Argument: ``--gamma 10.0`` - - Default value: None + - Yaml: ``gamma: 2.0`` + - Argument: ``--gamma 2.0`` + - Default value: ``2.0`` in focal loss function; ``0.2`` in shrinkage loss function; ``12.0`` in ``distmult``, ``RotatE``, and ``TransE`` link prediction decoders. - **alpha**: Set the value of the hyperparameter denoted by the symbol alpha. Alpha is used in the following cases: i/ focal loss for binary classification and ii/ shrinkage loss for regression. - - Yaml: ``alpha: 10.0`` - - Argument: ``--alpha 10.0`` - - Default value: None + - Yaml: ``alpha: 0.25`` + - Argument: ``--alpha 0.25`` + - Default value: ``0.25`` in focal loss function; ``10.0`` in shrinkage loss function. Classification and Regression Task ``````````````````````````````````` diff --git a/docs/source/index.rst b/docs/source/index.rst index 4bf679cf23..da167f782f 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -35,7 +35,7 @@ Welcome to the GraphStorm Documentation and Tutorials .. toctree:: :maxdepth: 2 - :caption: Advanced Topics + :caption: Practical & Advanced Guides :hidden: :glob: @@ -44,11 +44,12 @@ Welcome to the GraphStorm Documentation and Tutorials advanced/link-prediction advanced/advanced-wholegraph advanced/multi-task-learning - advanced/advanced-usages advanced/using-graphbolt + advanced/multi-target-ntypes + advanced/imbalanced-labels advanced/gsprocessing-emr-ec2 -GraphStorm is a graph machine learning (GML) framework designed for enterprise use cases. It simplifies the development, training and deployment of GML models on industry-scale graphs (measured in billons of nodes and edges) by providing scalable training and inference pipelines of GML models. GraphStorm comes with a collection of built-in GML models, allowing users to train a GML model with a single command, eliminating the need to write any code. Moreover, GraphStorm provides a wide range of configurations to customiz model implementations and training pipelines, enhancing model performance. In addition, GraphStorm offers a programming interface that enables users to train custom GML models in a distributed manner. Users can bring their own model implementations and leverage the GraphStorm training pipeline for scalability. +GraphStorm is a graph machine learning (GML) framework designed for enterprise use cases. It simplifies the development, training and deployment of GML models on industry-scale graphs (measured in billions of nodes and edges) by providing scalable training and inference pipelines of GML models. GraphStorm comes with a collection of built-in GML models, allowing users to train a GML model with a single command, eliminating the need to write any code. Moreover, GraphStorm provides a wide range of configurations to customize model implementations and training pipelines, enhancing model performance. In addition, GraphStorm offers a programming interface that enables users to train custom GML models in a distributed manner. Users can bring their own model implementations and leverage the GraphStorm training pipeline for scalability. Getting Started ---------------- @@ -83,16 +84,18 @@ The released GraphStorm APIs list the major components that can help users to de To help users use these APIs, GraphStorm also released a set of Jupyter notebooks at :ref:`GraphStorm API Programming Example Notebooks`. By running these notebooks, users can explore some APIs, learn how to use APIs to reproduce CLIs pipelines, and then customize GraphStorm components for specific requirements. -Users can find the comprehensive descriptions of these GraphStorm APIs in the :ref:`API Reference` documentations. For unrelease APIs, we encourage users to read their source code. If users want to have more APIs formally released, please raise issues at the `GraphStorm GitHub Repository `_. +Users can find the comprehensive descriptions of these GraphStorm APIs in the :ref:`API Reference` documentations. For unreleased APIs, we encourage users to read their source code. If users want to have more APIs formally released, please raise issues at the `GraphStorm GitHub Repository `_. -Advanced Topics ----------------- +Practical and Advanced Guides +------------------------------ - For users who want to use their own GML models in GraphStorm, follow the :ref:`Use Your Own GNN Models` tutorial to learn the programming interfaces and the steps of how to modify users' own models. - For users who want to leverage language models on nodes with text features, follow the :ref:`Use Language Model in GraphStorm` tutorial to learn how to leverage BERT models to use text as node features in GraphStorm. - There are various usages of GraphStorm to both speed up training process and help to boost model performance for link prediction tasks. Users can find these usages in the :ref:`Link Prediction Learning in GraphStorm` page. - GraphStorm team has been working with NVIDIA team to integrate the NVIDIA's WholeGraph library into GraphStorm for speed-up of feature copy. Users can follow the :ref:`Use WholeGraph in GraphStorm` tutorial to know more details. -- In v0.3, GraphStorm releases an experimental feature to support multi-task learning on the same graph, allowing users to define multiple training targets on different nodes and edges within a single training loop. Users can check the :ref:`Multi-task Learning in GraphStorm` tutorial to know more details. +- Since v0.3, GraphStorm releases the feature to support multi-task learning on the same graph, allowing users to define multiple training targets on different nodes and edges within a single training loop. Users can check the :ref:`Multi-task Learning in GraphStorm` tutorial to know more details. +- Since v0.4, GraphStorm adds support for GraphBolt stochastic training. GraphBolt is a new data loading module for DGL that enables faster and more efficient graph sampling, potentially leading to significant efficiency benefits. For detailed use pf GraphBolt in GraphStorm, follow the :ref:`Using GraphBolt to speed up training and inference` guide. +- For questions users asked frequently, there are several guides. The :ref:`Multiple Target Node Types Training` document provides guides of using multiple target node types in training. The :ref:`Deal with Imbalance Labels in Classification/Regression` guide lists several built-in features that can help to tackle challenge of imbalanced labels. If users want to use their own AWS EMR for graph processing, the :ref:`Running distributed graph processing on customized EMR-on-EC2 clusters` guide provides more details. Contribution -------------