‘BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation’

“We address the task of supervised action segmentation which aims to partition a video into non-overlapping segments, each representing a different action. Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost and cannot well capture action dependencies over long temporal horizons. To address these issues, we propose an efficient BI-level Temporal modeling (BIT) framework that learns explicit action tokens to represent action segments, in parallel performs temporal modeling on frame and action levels, while maintaining a low computational cost.”

Find the paper and full list of authors at ArXiv.

View on Site: ‘BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation’