Embedded Systems Course Final Project
View the Project on GitHub Alanhsiu/ECM202A_2025Fall_Project_14
| Name | Role | GitHub | |
|---|---|---|---|
| Cheng-Hsiu (Alan) Hsieh | Project Lead / ML Engineer | alanhsiu@ucla.edu | @Alanhsiu |
| Daniel Lee | Hardware Integration | daniellee1106@ucla.edu | @Daniel-Lee-1106 |
| Ting-Yu Yeh | Hardware Integration | tingyu0225@ucla.edu | @TingYu0225 |
This project implements an Adaptive Multimodal Deep Network (ADMN) for robust gesture recognition using RGB-D data in real-world scenarios with varying data quality. Traditional multimodal systems allocate fixed computational resources regardless of input quality, leading to inefficiency when one modality is corrupted. Our system intelligently allocates computational resources across RGB and Depth modalities based on input quality assessment, achieving 100% accuracy with a 12-layer adaptive budget while using only half the layers of a fixed 24-layer baseline. We successfully deployed the model on a Raspberry Pi 5 for real-time edge inference, demonstrating practical applicability for embedded gesture recognition systems.
Gesture recognition systems are increasingly deployed in real-world environments where input data quality varies significantly. RGB cameras may struggle in low-light conditions, while depth sensors can be occluded or produce noisy measurements. Traditional multimodal fusion approaches allocate fixed computational resources to each modality regardless of their quality, leading to:
Our objective is to build an adaptive gesture recognition system that:
Current approaches to multimodal gesture recognition include:
The key gap is the lack of quality-aware dynamic layer allocation across modalities that can respond to real-time input conditions while maintaining a fixed computational budget.
Our approach introduces several novel elements:
We expect this to succeed because:
If successful, this project could:
The main challenges we addressed:
| Metric | Target | Achieved |
|---|---|---|
| Overall Accuracy | β₯95% | β 100.00% (12L) |
| Accuracy under Corruption | β₯90% per type | β 100% clean, 100% depth-occ, 97.5% low-light |
| Adaptive Allocation | Learn corruption-aware patterns | β 11:1 RGB on occlusion, 1:11 Depth on low-light |
| Edge Latency | <1 second per frame | β 727ms (12L), 521ms (8L) |
| Layer Reduction vs Baseline | β₯30% fewer layers | β 50% reduction (12 vs 24 layers) |
[Li16] proposed RGB-D fusion networks using dual-stream CNNs with late fusion. While effective on clean RGB-D inputs, the method does not consider scenarios where one modality becomes unreliable. Our approach addresses this with dynamic allocation.
[Li18] proposed a cross-modal attentional framework for RGB-D object detection. However, their fusion relies on feature concatenation, which implies static feature weighting that cannot adapt to runtime input quality variations.
[Teerapittayanon16] developed BranchyNet for early exit in CNNs, reducing computation for βeasyβ samples. This inspired our layer-wise allocation but extends it to the multimodal setting.
[Meng21] proposed AdaFuse for adaptive temporal fusion in video understanding, dynamically selecting which feature channels to compute or reuse. Our work extends this by allocating computational depth within each modality.
[Dosovitskiy21] introduced Vision Transformer (ViT), achieving strong results on image classification. We use ViT backbones for both RGB and Depth streams.
[He22] developed Masked Autoencoders (MAE) for self-supervised ViT pretraining. We leverage MAE-pretrained weights for better initialization.
[Fan21] proposed LayerDrop for efficient transformer training by randomly dropping layers. We use this during Stage 1 training for regularization.
[Hospedales21] surveyed meta-learning approaches that could enable runtime adaptation. Our QoI module can be viewed as a learned quality assessor.
[Wu25 - ADMN Paper] introduced the Adaptive Deep Multimodal Network framework for layer-wise allocation based on input noise levels. This is our primary reference and inspiration.
[Jang17] introduced Gumbel-Softmax for differentiable sampling from categorical distributions. We use this for our layer allocation decisions.
[Madison17] proposed the Concrete distribution (concurrently introduced as Gumbel-Softmax by [Jang17]) to facilitate differentiable sampling from categorical distributions.
[Bengio13] analyzed the Straight-Through Estimator for training networks with discrete components. This is critical for gradient flow through our allocation module.


Our system follows a three-stage pipeline:
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. Data Collection & Preprocessing β
β (Collect Clean & Corrupted RGB-D Data) β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. Model Training (Two-Stage) β
β Stage 1: Baseline Classifier β
β Stage 2: Adaptive Controller β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. Edge Deployment & Real-Time Inference β
β (Raspberry Pi 5 with Intel RealSense) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββ βββββββββββββββββββββββ
β RGB Input β β Depth Input β
β 224Γ224Γ3 β β 224Γ224Γ3 β (depth expanded to 3ch)
ββββββββββββ¬βββββββββββ ββββββββββββ¬βββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββ βββββββββββββββββββββββ
β ViT (12L) Backbone β β ViT (12L) Backbone β
β (RGB Features) β β (Depth Features) β
ββββββββββββ¬βββββββββββ ββββββββββββ¬βββββββββββ
β β
βββββββββββββββ¬ββββββββββββββ
βΌ
βββββββββββββββββββββββ
β Fusion Transformer β
β (Multimodal Fusion) β
ββββββββββββ¬βββββββββββ
βΌ
βββββββββββββββββββββββ
β Classifier β
β (4 Classes) β
βββββββββββββββββββββββ
βββββββββββββββ βββββββββββββββ
β RGB Input β β Depth Input β
ββββββββ¬βββββββ ββββββββ¬βββββββ
β β
ββββββββββ¬βββββββββββ
βΌ
ββββββββββββββββββββββββββββ
β ADMN Controller β
ββββββββββββββββββββββββββββ€
β 1. QoI Module β
β (Quality Perception) β
β Lightweight CNN β
ββββββββββββββββββββββββββββ€
β 2. Layer Allocator β
β (Decision Making) β
β Gumbel-Softmax + STE β
β - Total budget: L β
β - Output: L_rgb, L_d β
βββββββββββ¬βββββββββββββββββ
βΌ
[Allocation Mask: RGB Lβ : Depth Lβ]
(Lβ + Lβ = L, e.g., 11:1 or 1:11)
β
ββββββββ΄ββββββββ
βΌ βΌ
βββββββββββββββ βββββββββββββββ
β ViT (Lβ) β β ViT (Lβ) β
β RGB β β Depth β
β (Frozen) β β (Frozen) β
ββββββββ¬βββββββ ββββββββ¬βββββββ
β β
βββββββββ¬ββββββββ
βΌ
βββββββββββββββ
β Fusion & CLSβ
β (Frozen) β
ββββββββ¬βββββββ
βΌ
[Output]
| Property | Value |
|---|---|
| Gesture Classes | 4 (standing, left_hand, right_hand, both_hands) |
| Corruption Types | 3 (clean, depth_occluded, low_light) |
| Total Samples | 600 (200 per corruption type, 50 per class per type) |
| RGB Resolution | 224Γ224Γ3 |
| Depth Resolution | 224Γ224Γ1 (expanded to 3 channels for ViT) |
| Train/Val Split | 80/20 stratified by corruption Γ class |
# RGB Transform
rgb_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Depth Transform (expand to 3 channels)
depth_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Lambda(lambda x: x.repeat(3, 1, 1)),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
Augmentation Strategy (no horizontal flips to preserve left/right semantics):
A lightweight CNN that extracts quality-relevant features from both modalities:
class QoIModule(nn.Module):
def __init__(self, in_channels=6): # RGB(3) + Depth(3)
self.conv1 = nn.Conv2d(in_channels, 32, 3, stride=2, padding=1)
self.conv2 = nn.Conv2d(32, 64, 3, stride=2, padding=1)
self.conv3 = nn.Conv2d(64, 128, 3, stride=2, padding=1)
self.pool = nn.AdaptiveAvgPool2d(1)
self.fc = nn.Linear(128, 64)
def forward(self, rgb, depth):
x = torch.cat([rgb, depth], dim=1) # [B, 6, 224, 224]
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
x = F.relu(self.conv3(x))
x = self.pool(x).flatten(1)
return self.fc(x) # [B, 64] quality features
class LayerAllocator(nn.Module):
def __init__(self, total_layers=12, num_vit_layers=12):
self.total_layers = total_layers
self.num_vit_layers = num_vit_layers
self.allocator = nn.Sequential(
nn.Linear(64, 128),
nn.ReLU(),
nn.Linear(128, 2 * num_vit_layers) # Logits for each layer
)
def forward(self, qoi_features, temperature=1.0):
logits = self.allocator(qoi_features) # [B, 24]
logits = logits.view(-1, 2, self.num_vit_layers) # [B, 2, 12]
# Gumbel-Softmax for differentiable sampling
soft_allocation = F.gumbel_softmax(logits, tau=temperature, hard=False, dim=1)
# Straight-Through Estimator for discrete forward pass
hard_allocation = (soft_allocation == soft_allocation.max(dim=1, keepdim=True)[0]).float()
allocation = hard_allocation - soft_allocation.detach() + soft_allocation
return allocation # [B, 2, 12] binary masks
Stage 1: Standard cross-entropy
Lβ = CrossEntropy(Ε·, y)
Stage 2: Classification + Allocation supervision
Lβ = Ξ± Β· L_cls + Ξ² Β· L_alloc
where:
L_cls = CrossEntropy(Ε·, y)
L_alloc = MSE(actual_ratio, target_ratio)
target_ratio by corruption:
- clean: [0.5, 0.5]
- depth_occluded: [0.9, 0.1] β favor RGB
- low_light: [0.1, 0.9] β favor Depth
| Component | Version/Details |
|---|---|
| Python | 3.8+ |
| PyTorch | 2.0+ |
| timm | 0.9.x (for ViT models) |
| OpenCV | 4.x (image processing) |
| TensorBoard | Logging and visualization |
Training Environment:
Deployment Environment:
| Total Layers | GFLOPs | Avg Latency (ms) | Accuracy |
|---|---|---|---|
| 4 | 2.11 | 294 | 37.50% |
| 6 | 3.04 | 377 | 80.00% |
| 8 | 3.97 | 521 | 98.33% |
| 12 | 5.84 | 727 | 100.00% |
| 24 (baseline) | 11.43 | 1201 | 100.00% |
| Decision | Rationale |
|---|---|
| Two-Stage Training | Separating feature learning from allocation learning simplifies optimization and prevents controller from degrading backbone |
| Frozen Backbones in Stage 2 | Ensures pre-trained features are preserved; only controller adapts |
| MAE Pretraining | Self-supervised pretraining provides better initialization than ImageNet for RGB-D |
| Gumbel-Softmax + STE | Enables end-to-end training through discrete allocation decisions |
| Per-Layer Allocation | Finer-grained control than per-modality; allows partial use of each backbone |
| No Horizontal Flips | Left/right gestures would be mislabeled with flips |
| Corruption-Type Supervision | Provides clear signal for allocation learning without explicit quality labels |
| Model | Total Layers | Best Val Acc | Test Acc | Notes |
|---|---|---|---|---|
| Stage 1 (Upper Bound) | 24 (12+12) | 100.00% | 100.00% | Fixed allocation |
| Stage 2 Adaptive | 12 | 100.00% | 100.00% | Quality-aware |
| Stage 2 Adaptive | 8 | 98.33% | 98.33% | Budget-efficient |
| Stage 2 Adaptive | 6 | 80.00% | 80.00% | Mid-budget |
| Stage 2 Adaptive | 4 | 37.50% | 37.50% | Too constrained |
The controller learned strong corruption-aware allocation patterns:
| Corruption Type | RGB Layers | Depth Layers | Strategy |
|---|---|---|---|
| Clean | 6.1 / 12 | 5.9 / 12 | Nearly balanced |
| Depth Occluded | 7.5 / 12 | 4.5 / 12 | Favor RGB π΄ |
| Low Light | 2.0 / 12 | 10.0 / 12 | Favor Depth π΅ |
| Allocation (RGB / Depth) | Test Accuracy | Clean | Depth Occ | Low Light |
|---|---|---|---|---|
| 12 / 0 (RGB only) | 92.50% | 95.0% | 100.0% | 82.5% |
| 0 / 12 (Depth only) | 85.83% | 97.5% | 60.0% | 100.0% |
| 6 / 6 (Uniform) | 73.33% | 87.5% | 72.5% | 60.0% |
| Dynamic (Ours) | 100.00% | 100.0% | 100.0% | 100.0% |






| Class | Clean | Depth Occluded | Low Light |
|---|---|---|---|
| Standing | 100% | 100% | 100% |
| Left Hand | 100% | 100% | 95% |
| Right Hand | 100% | 100% | 95% |
| Both Hands | 100% | 100% | 100% |
We try out different structure to solve the challenge we faced.


The chart shows that multithreading achieves lower latency than multiprocessing in all cases.

Expand to More Gestures: Scale to larger gesture vocabularies (20+ classes)
Voice Recognition Integration: Incorporate on-device voice-triggered commands
Migration to Smaller Hardware: Deploy on MCUs or ultra-low-cost embedded platforms
Additional Corruptions: Test robustness to motion blur, depth noise, partial occlusions
Model Compression: Apply quantization (INT8) and pruning for faster edge inference
Online Adaptation: Enable the controller to adapt during deployment without retraining
Multi-Task Learning: Extend to simultaneous gesture recognition and pose estimation
We successfully implemented an Adaptive Multimodal Deep Network for RGB-D gesture recognition that:
The key insight is that quality-aware dynamic allocation can match fixed-allocation performance while significantly reducing computation, enabling efficient edge deployment for multimodal systems.
[Bengio13] Y. Bengio, N. LΓ©onard, and A. Courville, βEstimating or propagating gradients through stochastic neurons for conditional computation,β arXiv preprint arXiv:1308.3432, 2013.
[Panda21] R. Panda, C. Chen, Q. Fan et al., βAdaMML:Adaptive multi-modal learning for efficient video recognition,β arXiv:2105.05165, 2021.
[Dosovitskiy21] A. Dosovitskiy et al., βAn image is worth 16x16 words: Transformers for image recognition at scale,β ICLR, 2021.
[Fan21] A. Fan et al., βReducing transformer depth on demand with structured dropout,β ICLR, 2020.
[He22] K. He et al., βMasked autoencoders are scalable vision learners,β IEEE CVPR, 2022.
[Hospedales21] T. Hospedales et al., βMeta-learning in neural networks: A survey,β IEEE TPAMI, 2021.
[Jang17] E. Jang, S. Gu, and B. Poole, βCategorical reparameterization with gumbel-softmax,β ICLR, 2017.
[Li18] G. Li et al., βCross-modal attentional context learning for RGB-D object detection,β arXiv:1810.12829, 2018.
[Shazeer17] N. Shazeer et al., βOutrageously large neural networks: The sparsely-gated mixture-of-experts layer,β ICLR, 2017.
[Teerapittayanon16] S. Teerapittayanon et al., βBranchyNet: Fast inference via early exiting from deep neural networks,β ICPR, 2016.
[Li16] Y. Li et al., βLarge-scale gesture recognition with a fusion of RGB-D data based on the C3D model,β ICPR, 2016.
[Meng21] Y. Meng et al., βAdaFuse: Adaptive temporal fusion network for efficient action recognition,β arXiv:2102.05775, 2021.
[Madison17] C. Maddison et al., βThe concrete distribution: A continuous relaxation of discrete random variables,β arXiv:1611.00712 , 2017.
[Wu25] J. Wu et al., βA layer-wise adaptive multimodal network for dynamic input noise and compute resources,β arXiv:2502.07862, 2025.
data/
βββ clean/
β βββ standing/
β β βββ color_image_0.png
β β βββ depth_image_0.png
β βββ left_hand/
β βββ right_hand/
β βββ both_hands/
βββ depth_occluded/
β βββ [same structure]
βββ low_light/
βββ [same structure]
data_new/): Google Drive folder containing clean, depth_occluded, and low_light. Place them under data_new/ to match the training commands below.| Library | Version | Purpose |
|---|---|---|
| PyTorch | 2.0+ | Deep learning framework |
| timm | 0.9.x | Vision Transformer models |
| OpenCV | 4.x | Image I/O and processing |
| NumPy | 1.24+ | Numerical operations |
| Matplotlib | 3.x | Visualization |
| TensorBoard | 2.x | Training logging |
| pyrealsense2 | 2.50 | RealSense camera interface |
| Module | Description |
|---|---|
software/models/gesture_classifier.py |
Stage 1 baseline RGB-D classifier |
software/models/adaptive_controller.py |
Stage 2 ADMN controller |
software/GTDM_Lowlight/models/timm_vit.py |
ViT backbone |
software/GTDM_Lowlight/models/vit_dev.py |
Custom ViT with layer selection |
data/gesture_dataset.py |
PyTorch Dataset class |
data/common_loaders.py |
Data loading utilities |
software/scripts/train_stage1.py |
Stage 1 training script |
software/scripts/train_stage2.py |
Stage 2 training script |
software/scripts/inference_stage1.py |
Stage 1 inference script |
software/scripts/inference_stage2.py |
Stage 2 inference script |
software/utils/visualize_baselines.py |
Results visualization |
# Clone and setup
git clone https://github.com/Alanhsiu/ECM202A_2025Fall_Project_14.git
cd ECM202A_2025Fall_Project_14
pip install -r requirements.txt
# Train Stage 1
python software/scripts/train_stage1.py --data_dir data --output_dir checkpoints/stage1
# Train Stage 2
python software/scripts/train_stage2.py \
--stage1_checkpoint checkpoints/stage1/best_model.pth \
--total_layers 12 \
--output_dir checkpoints/stage2
# Run inference
python software/scripts/inference_stage1.py \
--checkpoint checkpoints/stage1/best_model.pth
# Run inference
python software/scripts/inference_stage2.py \
--checkpoint checkpoints/stage2/best_controller_12layers.pth
1) Download the full dataset from the Google Drive link above and place it under data_new/ (keep clean/depth_occluded/low_light subfolders).
2) Run training/evaluation:
bash software/run.sh (Stage 1 β Stage 2).bash software/run_baselines.sh (dynamic/naive/reduced budgets).checkpoints/, logs/, and results/baselines/, matching the reported results.For questions or collaborations: