diff --git a/docs/source/advanced/advanced-usages.rst b/docs/source/advanced/advanced-usages.rst
deleted file mode 100644
index bccb4d86d6..0000000000
--- a/docs/source/advanced/advanced-usages.rst
+++ /dev/null
@@ -1,47 +0,0 @@
-.. _advanced_usages:
-
-GraphStorm Advanced Usages
-===========================
-
-Multiple Target Node Types Training
--------------------------------------
-
-When training on a hetergenious graph, we often need to train a model by minimizing the objective function on more than one node type. GraphStorm provides supports to achieve this goal.
-
-- Train on multiple node types: The users only need to edit the ``target_ntype`` in model config YAML file to minimize the objective function defined on mutiple target node types. For example, by setting ``target_ntype`` as following, we can jointly optimize the objective function defined on "movie" and "user" node types.
-
-  .. code-block:: yaml
-
-    target_ntype:
-    -  movie
-    -  user
-
-  During evuation, the users can set a single node type for evaluation. For example, by setting ``eval_target_ntype:  movie``, we will only perform evaluation on "movie" node type.
-
-- Evaluate on single node type: During evuation, the users can set a single node type for evaluation. For example, by setting ``eval_target_ntype:  movie``, we will only perform evaluation on "movie" node type. Our current implementation only support evaluating on a single node type.
-
-- Per target node type decoder: The users may also want to use a different decoder on each node type, where the output dimension for each decoder maybe different. We can achieve this by setting ``num_classes`` in model config YAML file. For example, by setting ``num_classes`` as following, GraphStorm will create a decoder with output dimension as 3 for movie node type, and a decoder with output dimension as 7 for user node type.
-
-  .. code-block:: yaml
-
-    num_classes:
-      movie:  3
-      user:  7
-
-- Reweighting on loss function: The users may also want to use a customized loss function reweighting on each node type, which can be achieved by setting ``multilabel``, ``multilabel_weights``, and ``imbalance_class_weights``. Examples are illustrated as following. Our current implementation does not support different node types with different ``multilabel`` setting.
-
-  .. code-block:: yaml
-
-    multilabel:
-      movie:  true
-      user:  true
-    multilabel_weights:
-      movie:  0.1,0.2,0.3
-      user:  0.1,0.2,0.3,0.4,0.5,0.0
-
-    multilabel:
-      movie:  false
-      user:  false
-    imbalance_class_weights:
-      movie:  0.1,0.2,0.3
-      user:  0.1,0.2,0.3,0.4,0.5,0.0
diff --git a/docs/source/advanced/imbalanced-labels.rst b/docs/source/advanced/imbalanced-labels.rst
new file mode 100644
index 0000000000..f30c125938
--- /dev/null
+++ b/docs/source/advanced/imbalanced-labels.rst
@@ -0,0 +1,73 @@
+.. _imbalanced_labels:
+
+Deal with Imbalanced Labels in Classification/Regression
+==========================================
+
+In some cases, the number of labels of different classes could be imbalanced, i.e., some classes
+have either too many or too few data points. For example, most fraud detection tasks only have a
+small number of fraudulent activities (positive labels) versus a huge number of legitimate activities
+(negative labels). Even in regression tasks, it is possible to encounter many dominant values that
+can cause imbalanced labels. If not handled properly, these imbalanced labels could impact classification/regression
+model performance a lot. For example, because too many negative labels are fit into models, models
+may learn to classify all unseen samples as negative. GraphStorm
+provides several ways to tackle the class imbalance problem.
+
+For classification tasks, users can configure two arguments in command line interfaces (CLIs), the
+``imbalance_class_weights`` and ``class_loss_func``.
+
+The ``imbalance_class_weights`` allows users to give scale weights for each class, hence forcing models
+to learn more on the classes with higher scale weight. For example, if there are 10 positive labels versus
+90 negative labels, you can set ``imbalance_class_weights`` to be ``0.1, 0.9``, meaning class 0 (usually
+for negative labels) has weight ``0.1``, and class 1 (usually for positive labels) has weight ``0.9``.
+This places more importance on correctly classifying positive samples and less on negative ones. Below
+is an example about how to set the ``imbalance_class_weights`` in a YAML configuration file.
+
+  .. code-block:: yaml
+
+    imbalance_class_weights: 0.1,0.9
+
+You can also set ``focal`` as the ``class_loss_func`` configuration's value, which will use the
+`focal loss function <https://arxiv.org/abs/1708.02002>`_ in binary classification tasks. The focal loss
+function is designed for imbalanced classes. Its formula is :math:`loss(p_t) = -\alpha_t(1-p_t)^{\gamma}log(p_t)`,
+where :math:`p_t=p`, if :math:`y=1`, otherwise, :math:`p_t = 1-p`. Here :math:`p` is the probability of output
+in a binary classification. This function has two hyperparameters, :math:`\alpha` and :math:`\gamma`,
+corresponding to the ``alpha`` and ``gamma`` configuration in GraphStorm. Larger values of ``gamma`` will help
+update models on harder cases so as to detect more positive samples if the positive to negative ratio is small.
+There is no clear guideline for values of ``alpha``. You can use its default value(``0.25``) first, and then
+search for optimal values. Below is an example about how to set the `focal loss funciton` in a YAML configuration file.
+
+  .. code-block:: yaml
+
+    class_loss_func: focal
+
+    gamma: 10.0
+    alpha: 0.5
+
+Apart from focal loss and class weights, you can also output the classification results as probabilities of positive and negative
+classes by setting the value of ``return_proba`` configuration to be ``true``. By default GraphStorm outputs
+classification results using the argmax values, e.g., either 0s or 1s in binary tasks, which equals to using
+``0.5`` as the threshold to classify negative from positive samples. With probabilities as outputs, you can use
+different thresholds, hence being able to achieve desired outcomes. For example, if you need higher recall to catch
+more suspicious positive samples, a smaller threshold, e.g., "0.25", could classify more positive cases. Or you may
+use methods like `ROC curve` or `Precision-Recall curve` to determine the optimal threshold. Below is an example about how
+to set the ``return_proba`` in a YAML configuration file.
+
+  .. code-block:: yaml
+
+    return_proba: true
+
+For regression tasks where there are some dominant values, e.g., 0s, in labels, GraphStorm provides the
+`shrinkage loss function <https://openaccess.thecvf.com/content_ECCV_2018/html/Xiankai_Lu_Deep_Regression_Tracking_ECCV_2018_paper.html>`_,
+which can be set by using ``shrinkage`` as value of the ``regression_loss_func`` configuration. Its formula is
+:math:`loss = l^2/(1 + \exp \left( \alpha \cdot (\gamma - l)\right))`, where :math:`l` is the absolute difference
+between predictions and labels. The shrinkage loss function also has the :math:`\alpha` and :math:`\gamma` hyperparameters.
+You can use the same ``alpha`` and ``gamma`` configuration as the focal loss function to modify their values. The shrinkage
+loss penalizes the importance of easy samples (when :math:`l < 0.5`) and keeps the loss of hard samples unchanged. Below is
+an example about how to set the `shrinkage loss function` in a YAML configuration file.
+
+  .. code-block:: yaml
+
+    regression_loss_func: shrinkage
+
+    gamma: 0.2
+    alpha: 5
diff --git a/docs/source/advanced/multi-target-ntypes.rst b/docs/source/advanced/multi-target-ntypes.rst
new file mode 100644
index 0000000000..9681e8cad7
--- /dev/null
+++ b/docs/source/advanced/multi-target-ntypes.rst
@@ -0,0 +1,196 @@
+.. _multi_target_ntypes:
+
+Multiple Target Node Types Training
+===================================
+
+When training on a heterogeneous graph, we often need to train a model by minimizing the objective
+function on more than one node type. GraphStorm provides supports to achieve this goal. The recommended
+method is to leverage GraphStorm's multi-task learning method, i.e., using multiple node tasks, and each
+trained on one target node type. 
+
+More detailed guide of using multi-task learning can be found in
+:ref:`Multi-task Learning in GraphStorm<multi_task_learning>`. This guide provides two examples of how
+to conduct two target node type classification training on the `movielen 100k <https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset>`_
+data, where the **movie** ("item" in the original data) and **user** node types have classification
+labels associated.
+
+Using multi-task learning for multiple target node types training (Recommended)
+--------------------------------------------------------------------------------
+
+Preparing the training data
+............................
+
+During graph construction step, you can define two classification tasks on the two node types as
+shown in the JSON example below.
+
+.. code-block:: json
+
+    {
+        "version": "gconstruct-v0.1",
+        "nodes": [
+            {
+                "node_type": "movie",
+                ......
+                ],
+                "labels": [
+                    {
+                        "label_col": "label_movie",
+                        "task_type": "classification",
+                        "split_pct":	[0.8, 0.1, 0.1],
+                        "mask_field_names": ["train_mask_movie",
+                                             "val_mask_movie",
+                                             "test_mask_movie"]
+                    },
+                ]
+            },
+            {
+                "node_type": "user",
+                ......
+                ],
+                "labels": [
+                    {
+                        "label_col": "label_user",
+                        "task_type": "classification",
+                        "split_pct":	[0.2, 0.2, 0.6],
+                        "mask_field_names": ["train_mask_user",
+                                             "val_mask_user",
+                                             "test_mask_user"]
+                    },
+                ]
+            },
+        ],
+        ......
+    }
+
+The above configuration defines two classification tasks for the **movie** nodes and **user** nodes, respectively.
+Each node type has its own "lable_col" and train/validation/test mask fields associated. Then you can
+follow the instructions in :ref:`Run graph construction<run-graph-construction>` to use the GraphStorm
+construction tool for creating partitioned graph data.
+
+Define multi-task for model training
+...............................
+
+Now, you can specify two training tasks by providing the `multi_task_learning` configurations in
+the training configuration YAML file, like the example below.
+
+.. code-block:: yaml
+
+    ---
+    version: 1.0
+    gsf:
+        basic:
+            ...
+        multi_task_learning:
+            - node_classification:
+                target_ntype: "movie"
+                label_field: "label_movie"
+                mask_fields:
+                    - "train_mask_movie"
+                    - "val_mask_movie"
+                    - "test_mask_movie"
+                num_classes: 10
+                task_weight: 0.5
+            - node_classification:
+                target_ntype: "user"
+                label_field: "label_user"
+                mask_fields:
+                    - "train_mask_user"
+                    - "val_mask_user"
+                    - "test_mask_user"
+                task_weight: 1.0
+            ...
+
+The above configuration defines one classification task for the **movie** node type and another one
+for the **user** node type. The two node classification tasks will take their own label name, i.e.,
+`label_movie` and `label_user`, and their own train/validation/test mask fields. It also defines
+which prioritizes user node classification (task_weight = 1.0) over movie node classification (task_weight = 0.5).
+(`task_weight = 1.0`) than classification on **movie** nodes (`task_weight = 0.5`).
+
+Run multi-task model training
+..............................
+
+You can use the `graphstorm.run.gs_multi_task_learning` command to run multi-task learning tasks,
+like the following example.
+
+.. code-block:: bash
+
+    python -m graphstorm.run.gs_multi_task_learning \
+              --workspace <PATH_TO_WORKSPACE> \
+              --num-trainers 1 \
+              --num-servers 1 \
+              --part-config <PATH_TO_GRAPH_DATA> \
+              --cf <PATH_TO_CONFIG> \
+
+Run multi-task model Inference
+...............................
+
+For inference, you can use the same command line `graphstorm.run.gs_multi_task_learning`  with an
+additional argument `--inference` as the following:
+
+.. code-block:: bash
+
+    python -m graphstorm.run.gs_multi_task_learning \
+              --inference \
+              --workspace <PATH_TO_WORKSPACE> \
+              --num-trainers 1 \
+              --num-servers 1 \
+              --part-config <PATH_TO_GRAPH_DATA> \
+              --cf <PATH_TO_CONFIG> \
+              --save-prediction-path <PATH_TO_OUTPUT>
+
+The prediction results of each prediction tasks will be saved into different sub-directories under
+<PATH_TO_OUTPUT>. The sub-directories are prefixed with the `<task_type>_<node/edge_type>_<label_name>`.
+
+Using multi-target node type training (Not Recommended)
+-------------------------------------------------------
+
+You can also use GraphStorm's multi-target node types configuration. But this method is less
+flexible than the multi-task learning method.
+
+- Train on multiple node types: The users only need to edit the ``target_ntype`` in model config
+YAML file to minimize the objective function defined on mutiple target node types. For example,
+by setting ``target_ntype`` as following, we can jointly optimize the objective function defined
+on "movie" and "user" node types.
+
+  .. code-block:: yaml
+
+    target_ntype:
+    -  movie
+    -  user
+
+- During evaluation, the users need to choose a single node type. For example, by setting
+  ``eval_target_ntype: movie``, we will only perform evaluation on "movie" node type. GraphStorm
+  only supports evaluating on a single node type.
+
+- Per target node type decoder: The users may also want to use a different decoder on each node type,
+  where the output dimension for each decoder maybe different. We can achieve this by setting ``num_classes``
+  in model config YAML file. For example, by setting ``num_classes`` as following, GraphStorm will
+  create a decoder with an output dimension as 3 for movie node type, and a decoder with an output
+  dimension as 7 for user node type.
+
+  .. code-block:: yaml
+
+    num_classes:
+      movie:  3
+      user:  7
+
+- Reweighting on loss function: The users may also want to use a customized loss function reweighting
+  on each node type, which can be achieved by setting ``multilabel``, ``multilabel_weights``, and
+  ``imbalance_class_weights``. Examples are illustrated as following. Our current implementation does
+  not support different node types with different ``multilabel`` setting.
+
+  .. code-block:: yaml
+
+    multilabel:
+      movie:  true
+      user:  true
+    multilabel_weights:
+      movie:  0.1,0.2,0.3
+      user:  0.1,0.2,0.3,0.4,0.5,0.0
+
+    multilabel:
+      movie:  false
+      user:  false
+    imbalance_class_weights:
+      movie:  0.1,0.2,0.3
+      user:  0.1,0.2,0.3,0.4,0.5,0.0
diff --git a/docs/source/advanced/multi-task-learning.rst b/docs/source/advanced/multi-task-learning.rst
index a58f0337cb..e19050d1a3 100644
--- a/docs/source/advanced/multi-task-learning.rst
+++ b/docs/source/advanced/multi-task-learning.rst
@@ -277,15 +277,15 @@ You can define an edge feature reconstruction task as the following example:
                 eval_metric:
                     - "mse"
 
-In the configuration, `target_etype` defines the target edge type to which the reconstruct edge feature
-learning will be applied. `reconstruct_efeat_name`` defines the name of the feature to be
+In the configuration, `target_etype` defines the target edge type to which the reconstruct edge
+feature learning will be applied. `reconstruct_efeat_name`` defines the name of the feature to be
 reconstructed. The other configs are same as edge regression tasks.
 
 
 Run Model Training
 ~~~~~~~~~~~~~~~~~~~
-GraphStorm introduces a new command line `graphstorm.run.gs_multi_task_learning` with an additional
-argument `--inference` to run multi-task learning tasks. You can use the following command to start a multi-task training job:
+GraphStorm introduces a new command line `graphstorm.run.gs_multi_task_learning` to run multi-task
+learning tasks. You can use the following command to start a multi-task training job:
 
 .. code-block:: bash
 
@@ -298,7 +298,8 @@ argument `--inference` to run multi-task learning tasks. You can use the followi
 
 Run Model Inference
 ~~~~~~~~~~~~~~~~~~~~
-You can use the same command line `graphstorm.run.gs_multi_task_learning` to run inference as following:
+You can use the same command line `graphstorm.run.gs_multi_task_learning` with an additional
+argument `--inference` to run inference as following:
 
 .. code-block:: bash
 
@@ -312,7 +313,8 @@ You can use the same command line `graphstorm.run.gs_multi_task_learning` to run
               --save-prediction-path <PATH_TO_OUTPUT>
 
 The prediction results of each prediction tasks (node classification, node regression,
-edge classification and edge regression) will be saved into different sub-directories under PATH_TO_OUTPUT. The sub-directories are prefixed with the `<task_type>_<node/edge_type>_<label_name>`.
+edge classification and edge regression) will be saved into different sub-directories under PATH_TO_OUTPUT.
+The sub-directories are prefixed with the `<task_type>_<node/edge_type>_<label_name>`.
 
 Run Model Training on SageMaker
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/docs/source/cli/model-training-inference/configuration-run.rst b/docs/source/cli/model-training-inference/configuration-run.rst
index 7c8143d66f..36105c8f24 100644
--- a/docs/source/cli/model-training-inference/configuration-run.rst
+++ b/docs/source/cli/model-training-inference/configuration-run.rst
@@ -397,14 +397,14 @@ General Configurations
             - For link prediction tasks, the default value is ``mrr``.
 - **gamma**: Set the value of the hyperparameter denoted by the symbol gamma. Gamma is used in the following cases: i/ focal loss for binary classification ii/ DistMult score function for link prediction, iii/ TransE score function for link prediction, iv/ RotatE score function for link prediction, v/ shrinkage loss for regression.
 
-    - Yaml: ``gamma: 10.0``
-    - Argument: ``--gamma 10.0``
-    - Default value: None
+    - Yaml: ``gamma: 2.0``
+    - Argument: ``--gamma 2.0``
+    - Default value: ``2.0`` in focal loss function; ``0.2`` in shrinkage loss function; ``12.0`` in ``distmult``, ``RotatE``, and ``TransE`` link prediction decoders.
 - **alpha**: Set the value of the hyperparameter denoted by the symbol alpha. Alpha is used in the following cases: i/ focal loss for binary classification and ii/ shrinkage loss for regression.
 
-    - Yaml: ``alpha: 10.0``
-    - Argument: ``--alpha 10.0``
-    - Default value: None
+    - Yaml: ``alpha: 0.25``
+    - Argument: ``--alpha 0.25``
+    - Default value: ``0.25`` in focal loss function; ``10.0`` in shrinkage loss function.
 
 Classification and Regression Task
 ```````````````````````````````````
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 4bf679cf23..da167f782f 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -35,7 +35,7 @@ Welcome to the GraphStorm Documentation and Tutorials
 
 .. toctree::
    :maxdepth: 2
-   :caption: Advanced Topics
+   :caption: Practical & Advanced Guides
    :hidden:
    :glob:
 
@@ -44,11 +44,12 @@ Welcome to the GraphStorm Documentation and Tutorials
    advanced/link-prediction
    advanced/advanced-wholegraph
    advanced/multi-task-learning
-   advanced/advanced-usages
    advanced/using-graphbolt
+   advanced/multi-target-ntypes
+   advanced/imbalanced-labels
    advanced/gsprocessing-emr-ec2
 
-GraphStorm is a graph machine learning (GML) framework designed for enterprise use cases. It simplifies the development, training and deployment of GML models on industry-scale graphs (measured in billons of nodes and edges) by providing scalable training and inference pipelines of GML models. GraphStorm comes with a collection of built-in GML models, allowing users to train a GML model with a single command, eliminating the need to write any code. Moreover, GraphStorm provides a wide range of configurations to customiz model implementations and training pipelines, enhancing model performance. In addition, GraphStorm offers a programming interface that enables users to train custom GML models in a distributed manner. Users can bring their own model implementations and leverage the GraphStorm training pipeline for scalability.
+GraphStorm is a graph machine learning (GML) framework designed for enterprise use cases. It simplifies the development, training and deployment of GML models on industry-scale graphs (measured in billions of nodes and edges) by providing scalable training and inference pipelines of GML models. GraphStorm comes with a collection of built-in GML models, allowing users to train a GML model with a single command, eliminating the need to write any code. Moreover, GraphStorm provides a wide range of configurations to customize model implementations and training pipelines, enhancing model performance. In addition, GraphStorm offers a programming interface that enables users to train custom GML models in a distributed manner. Users can bring their own model implementations and leverage the GraphStorm training pipeline for scalability.
 
 Getting Started
 ----------------
@@ -83,16 +84,18 @@ The released GraphStorm APIs list the major components that can help users to de
 
 To help users use these APIs, GraphStorm also released a set of Jupyter notebooks at :ref:`GraphStorm API Programming Example Notebooks<programming-examples>`. By running these notebooks, users can explore some APIs, learn how to use APIs to reproduce CLIs pipelines, and then customize GraphStorm components for specific requirements.
 
-Users can find the comprehensive descriptions of these GraphStorm APIs in the :ref:`API Reference<api-reference>` documentations. For unrelease APIs, we encourage users to read their source code. If users want to have more APIs formally released, please raise issues at the `GraphStorm GitHub Repository <https://github.com/awslabs/graphstorm/issues>`_.
+Users can find the comprehensive descriptions of these GraphStorm APIs in the :ref:`API Reference<api-reference>` documentations. For unreleased APIs, we encourage users to read their source code. If users want to have more APIs formally released, please raise issues at the `GraphStorm GitHub Repository <https://github.com/awslabs/graphstorm/issues>`_.
 
-Advanced Topics
-----------------
+Practical and Advanced Guides
+------------------------------
 
 - For users who want to use their own GML models in GraphStorm, follow the :ref:`Use Your Own GNN Models<use-own-models>` tutorial to learn the programming interfaces and the steps of how to modify users' own models.
 - For users who want to leverage language models on nodes with text features, follow the :ref:`Use Language Model in GraphStorm<language_models>` tutorial to learn how to leverage BERT models to use text as node features in GraphStorm.
 - There are various usages of GraphStorm to both speed up training process and help to boost model performance for link prediction tasks. Users can find these usages in the :ref:`Link Prediction Learning in GraphStorm<link_prediction_usage>` page.
 - GraphStorm team has been working with NVIDIA team to integrate the NVIDIA's WholeGraph library into GraphStorm for speed-up of feature copy. Users can follow the :ref:`Use WholeGraph in GraphStorm<advanced_wholegraph>` tutorial to know more details.
-- In v0.3, GraphStorm releases an experimental feature to support multi-task learning on the same graph, allowing users to define multiple training targets on different nodes and edges within a single training loop. Users can check the :ref:`Multi-task Learning in GraphStorm<multi_task_learning>` tutorial to know more details.
+- Since v0.3, GraphStorm releases the feature to support multi-task learning on the same graph, allowing users to define multiple training targets on different nodes and edges within a single training loop. Users can check the :ref:`Multi-task Learning in GraphStorm<multi_task_learning>` tutorial to know more details.
+- Since v0.4, GraphStorm adds support for GraphBolt stochastic training. GraphBolt is a new data loading module for DGL that enables faster and more efficient graph sampling, potentially leading to significant efficiency benefits. For detailed use pf GraphBolt in GraphStorm, follow the :ref:`Using GraphBolt to speed up training and inference<using-graphbolt-ref>` guide.
+- For questions users asked frequently, there are several guides. The :ref:`Multiple Target Node Types Training<multi_target_ntypes>` document provides guides of using multiple target node types in training. The :ref:`Deal with Imbalance Labels in Classification/Regression<imbalanced_labels>` guide lists several built-in features that can help to tackle challenge of imbalanced labels. If users want to use their own AWS EMR for graph processing, the :ref:`Running distributed graph processing on customized EMR-on-EC2 clusters<gsprocessing_emr_ec2_customized_clusters>` guide provides more details.
 
 Contribution
 -------------