Files
2026-01-19 22:27:20 -05:00

216 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Handshape Sequence Classifier (MediaPipe + PyTorch, macOS MPS-ready)
Live ASL handshape letter demo powered by MediaPipe Hands landmarks and a bidirectional GRU sequence model.
Record short clips per letter, resample to a fixed length, train, evaluate, and run a real-time webcam demo that can react to detected letter sequences (e.g., **W → E → B** opens a URL).
## Features
* **Data capture UI:** 3-second centered countdown + top progress bar; fingertip dot feedback.
* **Robust normalization:** wrist-anchored, left/right mirroring, rotation to +Y, scale by max pairwise distance.
* **Fixed-length preprocessing:** linear resampling to *N* frames (default **32**).
* **Sequence model:** BiGRU (128 hidden × 2) → MLP head; light augmentation during training.
* **Live inference:** EMA smoothing + thresholding; emits letters only on change; detects special sequences (**WEB**) and opens a browser.
---
## Quick Start
```bash
# 0) (optional) Create & activate a virtual env
python -m venv .venv && source .venv/bin/activate
# 1) Install deps
pip install numpy opencv-python mediapipe torch scikit-learn
# 2) Make directories for the letters youll collect
./make_seq_dirs.sh A B J Z
# 3) Capture short clips per letter (train/val)
python capture_sequence.py --label A --split train
python capture_sequence.py --label A --split val
# ...repeat for B, J, Z
# 4) Preprocess → fixed-length dataset (32 frames)
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
# 5) Train the BiGRU
python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 \
--out asl_seq32_gru_ABJZ.pt
# 6) Evaluate on the validation set (confusion matrix + report)
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_ABJZ.pt
# 7) Live webcam demo (press 'q' to quit)
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7
```
> **WEB trigger:** In the live demo, if the emitted letters form **W → E → B**, the app prints a message and opens `--url` (default: Google).
> Example: `--url https://www.gallaudet.edu`
---
## Repository Layout
```
handshapes-multiclass/
├─ make_seq_dirs.sh # creates sequences/train|val/<LETTER>/
├─ capture_sequence.py # webcam capture → clip_XXX.npz (X: (T,63), tip: (T,2))
├─ prep_sequence_resampled.py # resample clips to fixed N frames → landmarks_seq32/
├─ train_seq.py # train BiGRU; saves best checkpoint (.pt + stats)
├─ eval_val.py # evaluate on val set; prints metrics
├─ infer_seq_webcam.py # live demo; emits letters; detects "WEB" → opens URL
├─ what_to_do.txt # quick, step-by-step playbook
└─ sequences/ # created by you (after running make_seq_dirs.sh)
├─ train/<LETTER>/clip_XXX.npz
└─ val/<LETTER>/clip_XXX.npz
```
**Clip file format (`clip_XXX.npz`)**
* `X`: `(T, 63)` — per-frame normalized landmarks (21 points × (x, y, z))
* `tip`: `(T, 2)` — normalized index fingertip positions (for sanity checks)
**Prepared dataset (`landmarks_seq32/`)**
* `train_X.npy`, `train_y.npy`, `val_X.npy`, `val_y.npy`
* `class_names.json` (e.g., `["A","B","J","Z"]`)
* `meta.json` (e.g., `{"frames":32,"input_dim":63}`)
**Checkpoint (`*.pt`)**
* `model` (state_dict), `classes`, `frames`, `X_mean`, `X_std`
---
## Normalization (consistent across capture & inference)
1. Translate so **wrist** (landmark 0) is at the origin.
2. If detected **left** hand, mirror `x *= -1`.
3. Rotate so the **middle-finger MCP** (landmark 9) points along **+Y**.
4. Scale all coords by the **max pairwise distance** among 2D landmarks.
5. Flatten to **63 features** per frame.
This ensures letter-style, not camera pose, drives classification.
---
## Training Details
* **Model:** BiGRU (input=63, hidden=128, bidirectional) → `[Linear(256→128), ReLU, Dropout(0.2), Linear(128→num_classes)]`
* **Optimizer:** AdamW (`lr=1e-3`, `weight_decay=1e-4`)
* **Scheduler:** CosineAnnealingLR (`T_max = epochs`)
* **Augmentation:** small 2D rotate (±7°), scale (±10%), Gaussian noise (σ=0.01)
* **Normalization:** global `X_mean`/`X_std` computed over **train** (time+batch), applied to both train & val and saved into the checkpoint.
---
## Live Inference Behavior
* Maintains a rolling buffer of **T = frames** (from the checkpoint).
* Applies the saved `X_mean`/`X_std`.
* **EMA smoothing** over softmax probs with time constant `--smooth` (seconds).
* Emits a letter only if:
* top prob ≥ `--threshold` (e.g., 0.8), **and**
* the letter **changed** from the previous emission (prevents repeats).
* Tracks a short history of emitted letters to detect **W → E → B**; on match:
* prints “Detected WEB! …”
* calls `webbrowser.open(--url)`
**Common flags**
```bash
# Camera & size
--camera 0 --width 640 --height 480
# Confidence vs. latency tradeoffs
--threshold 0.85 # higher → fewer false positives
--smooth 1.0 # higher → steadier output but more lag
# Action on sequence
--url https://example.com
```
---
## Tips for High Accuracy
* Record **balanced** train/val counts per class (e.g., 100 train / 20 val).
* Keep the hand **centered**, well lit, and mostly **single-hand** (model expects 1 hand).
* Maintain consistent **distance** and **orientation** during capture.
* If you add new letters later, just record them, re-run preprocessing, and retrain — classes are **auto-discovered** from `sequences/train/*`.
---
## macOS (M-series) Notes
* PyTorch will automatically use **Metal (MPS)** if available (`torch.backends.mps.is_available()`); otherwise CPU.
* If the webcam feed looks low FPS, try reducing `--width/--height` or raising `--threshold` / `--smooth`.
---
## Troubleshooting
* **“Could not open camera”** → try `--camera 1` (or check macOS camera permission).
* **No detections / “No hand” on screen** → improve lighting, ensure a single clear hand, check MediaPipe install.
* **Model emits wrong letters** → increase `--threshold`, collect more data, or raise `--smooth`.
* **Mismatch T during inference** → ensure `--frames` at preprocessing matches the checkpoints `frames` (saved & auto-used).
---
## Commands Reference
### Create class folders
```bash
./make_seq_dirs.sh A B J Z
```
### Capture clips
```bash
python capture_sequence.py --label A --split train --seconds 0.8 --count 100
python capture_sequence.py --label A --split val --seconds 0.8 --count 20
```
### Prepare dataset (resample to 32 frames)
```bash
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
```
### Train
```bash
python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 \
--out asl_seq32_gru_ABJZ.pt
```
### Evaluate
```bash
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_ABJZ.pt
```
### Live demo (open URL on “WEB”)
```bash
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7 \
--url https://www.gallaudet.edu
```
---
## License
MIT
---
## Acknowledgments
* **MediaPipe Hands** for robust, fast hand landmark detection.
* **PyTorch** for flexible sequence modeling on CPU/MPS.
---