Frequency Guidance Matters: Skeletal Action Recognition by Frequency-Aware Mixed Transformer [ACM MM 2024]
Wenhan Wu, Ce Zheng, Zihao Yang, Srijan Das, Chen Chen, Aidong Lu
Recently, transformers have demonstrated great potential for modeling long-term dependencies from skeleton sequences and thereby gained ever-increasing attention in skeleton action recognition. However, the existing transformer-based approaches heavily rely on the naive attention mechanism for capturing the spatiotemporal features, which falls short in learning discriminative representations that exhibit similar motion patterns. To address this challenge, we introduce the Frequency-aware Mixed Transformer (FreqMixFormer), specifically designed for recognizing similar skeletal actions with subtle discriminative motions. First, we introduce a frequency-aware attention module to unweave skeleton frequency representations by embedding joint features into frequency attention maps, aiming to distinguish the discriminative movements based on their frequency coefficients. Subsequently, we develop a mixed transformer architecture to incorporate spatial features with frequency features to model the comprehensive frequency-spatial patterns. Additionally, a temporal transformer is proposed to extract the global correlations across frames. Extensive experiments show that FreqMiXFormer outperforms SOTA on 3 popular skeleton action recognition datasets, including NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.
The overall design of our Frequency-aware Mixed Transformer. Our FreqMixFormer model overcomes the limitations of traditional transformer-based methods, which cannot effectively recognize confusing actions such as reading and writing due to the straightforward process of skeleton sequences. As highlighted with the colored boxes, the FreqMixFormer introduces the frequency domain and extracts high-frequency features, which often indicate subtle and dynamic movements (red), and low-frequency features, which are associated with slow and steady movements (blue). These features are then fused with spatial features. Our results demonstrate that the integrated frequency-spatial features significantly improve the model's capability to discern discriminative joint correlations.
We propose a Frequency-aware Attention Block (FAB) to investigate frequency features within skeletal sequences. A frequency operator is specifically designed to improve the learning of frequency coefficients, thereby enhancing the ability to capture discriminative correlations among joints.
Consequently, we introduce the Frequency-aware Mixed Transformer (FreqMixFormer) to extract frequency-spatial joint correlations. The model incorporates a temporal transformer designed to enhance its ability to capture temporal features across frames.
- Codes of FreqMixFOrmerV2 will be released soon.
- Paper of FreqMixFormerV2 is coming to Arxiv soon.
- FreqMixFormerV2 (A lightweight version of FreqMixFormer) is accepted by The 19th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2025) 2024/12/13
- Codes updated 2024/11/10
- Create the GitHub repository and project website on 2024/7/18
- Preprint paper is available on arXiv 2024/7/18
NTU RGB+D 60 and 120
- Request dataset: https://rose1.ntu.edu.sg/dataset/actionRecognition
- Download the skeleton-only datasets:
i.nturgbd_skeletons_s001_to_s017.zip
(NTU RGB+D 60)
ii.nturgbd_skeletons_s018_to_s032.zip
(NTU RGB+D 120)
iii. Extract above files to./data/nturgbd_raw
NW-UCLA
- Download dataset from here
- Move
all_sqe
to./data/NW-UCLA
Put downloaded data into the following directory structure:
- NW-UCLA/
- all_sqe
... # raw data of NW-UCLA
- ntu/
- ntu120/
- nturgbd_raw/
- nturgb+d_skeletons/ # from `nturgbd_skeletons_s001_to_s017.zip`
...
- nturgb+d_skeletons120/ # from `nturgbd_skeletons_s018_to_s032.zip`
...
- Generate NTU RGB+D 60 or NTU RGB+D 120 dataset:
cd ./data/ntu # or cd ./data/ntu120
# Get skeleton of each performer
python get_raw_skes_data.py
# Remove the bad skeleton
python get_raw_denoised_data.py
# Transform the skeleton to the center of the first frame
python seq_transformation.py
- Find the training commends in train.sh
- Find the training commends in testing.sh
- Find the training commends in ensemble.sh
If you find this code useful for your research, please consider citing the following paper:
@inproceedings{wu2024frequencyguidancemattersskeletal,
author = {Wenhan Wu and Ce Zheng and Zihao Yang and Chen Chen and Srijan Das and Aidong Lu},
title = {Frequency Guidance Matters: Skeletal Action Recognition by Frequency-Aware Mixed Transformer},
booktitle = {ACM Multimedia 2024},
year = {2024}
}
The code is mainly based on Skeleton-MixFormer. The data processing is borrowed from CTR-GCN. Thanks for their amazing work!
For any questions, feel free to create a new issue or contact:
Wenhan Wu: [email protected]