Abstract
Convolutional Neural Networks (CNNs) make vast and impressive progress in image recognition, but they cannot model long-range temporal structures due to the unidirectional convolution operations in the temporal dimension. Here, we propose a novel bidirectional two-stream network (BDTSNet) to comprehensively model the spatiotemporal features along the bidirection of an action sequence with two attention mechanisms. Specifically, we design one simplified attention mechanism selecting the keyframes as inputting of the network first, which can avoid highly redundant frames from misleading recognition. Subsequently, each branch of the BDTSNet comprises two networks that operate convolution along the forward and backward action sequence, simulating the bidirectional Long Short-Term Memory (BLSTM) processing of time series data. This structural design can comprehensively incorporate the semantic information from reverse sequences to accurately represent the described actions. Finally, the feature obtained by another designed attention mechanism assigning different importance to fuse spatiotemporal features characterizes the action that integrates the spatiotemporal information from all network branches. Extensive experiments reveal the proposed approach that efficiently recognizes complex actions and achieves 81.6% and 98.1% recognition accuracy on two widely used action recognition datasets, HMDB51 and UCF101, reducing three-quarters of the inputting RGB images.
| Original language | Undefined/Unknown |
|---|---|
| Pages (from-to) | 117558 |
| Number of pages | 1 |
| Journal | Signal Processing : Image Communication |
| Volume | 146 |
| DOIs | |
| Publication status | Published - 2026 |
Keywords
- Action recognition
- Bidirectional network
- Spatiotemporal features
- Keyframes selection
- Feature representation
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver