Comparative Evaluation of Deep Learning Architectures for Human Action Recognition
Main Article Content
Abstract
Human action recognition, or video-based activity classification, remains a challenging task due to complex spatiotemporal dynamics, variations in viewpoint, and motion blur. This study presents a comparative evaluation of three state-of-the-art deep learning architectures for video action recognition: Temporal Segment Networks (TSN), 3D Convolutional Networks (C3D), and Two-Stream Inflated 3D ConvNets (I3D). The models were implemented using the UCF101 dataset, comprising 13,320 videos from 101 action categories. Extensive preprocessing was performed to extract and resize video frames and optical flows. Experimental results demonstrate that C3D achieved the highest test accuracy of 83.2%, followed by I3D with 74.5% and TSN with 38.1%. Despite the computational cost, 3D convolutional architectures provided superior spatiotemporal feature learning. The findings highlight the trade-off between model complexity and training efficiency, and suggest that with improved computational resources, both I3D and TSN could achieve performance closer to current state-of-the-art benchmarks.