Endangered Species Recognition from Camera Trap Images with Vision Transformers

Yunke Wang

doi:10.61173/9dt9q857

Authors

Yunke Wang Author

DOI:

https://doi.org/10.61173/9dt9q857

Keywords:

Endangered species, Vision Transformer, CNN, Deep learning, Grad-CAM, Test-time augmentation.

Abstract

Automatic identification of endangered species from camera trap images grows more important for wildlife conservation. This task still poses challenges. Lighting conditions vary. Partial occlusion occurs. There are subtle inter-species differences. These differences require fine visual discrimination. This study proposes a dualbranch deep learning ensemble model. It integrates Swin Transformer and ConvNeXt architectures. The model captures complementary global and local features. It’s for the identification of 10 rare species. The team used 2,000 research-grade images. These images come from iNaturalist. Our model reached a Top-1 accuracy of 90.83%. That’s 3.5% higher than EfficientNet-B0, ViT-B16, and Swin-T. The training process used progressive unfreezing. It also used layer-wise learning rate decay. These methods achieve stable multi-scale feature adaptation on limited data. They also suppress overfitting. GradCAM visualizations confirm that the model consistently attends to anatomically discriminative regions—such as rosette patterns and stripe configurations—thereby reducing interspecies confusion. Test-time augmentation further enhances robustness against occlusion and illumination variability. The final system supports practical edge deployment, running at 15 FPS on an NVIDIA Jetson Nano with INT8 quantization. This work demonstrates that hybrid Transformer–CNN architectures are effective and deployable for real-world conservation monitoring.

Endangered Species Recognition from Camera Trap Images with Vision Transformers

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section