Files
slr_handshapes_locations/first_attempt_landmark_hands/README.md
2026-01-19 22:27:20 -05:00

7.2 KiB
Raw Blame History

Handshape Sequence Classifier (MediaPipe + PyTorch, macOS MPS-ready)

Live ASL handshape letter demo powered by MediaPipe Hands landmarks and a bidirectional GRU sequence model. Record short clips per letter, resample to a fixed length, train, evaluate, and run a real-time webcam demo that can react to detected letter sequences (e.g., W → E → B opens a URL).

Features

  • Data capture UI: 3-second centered countdown + top progress bar; fingertip dot feedback.
  • Robust normalization: wrist-anchored, left/right mirroring, rotation to +Y, scale by max pairwise distance.
  • Fixed-length preprocessing: linear resampling to N frames (default 32).
  • Sequence model: BiGRU (128 hidden × 2) → MLP head; light augmentation during training.
  • Live inference: EMA smoothing + thresholding; emits letters only on change; detects special sequences (WEB) and opens a browser.

Quick Start

# 0) (optional) Create & activate a virtual env
python -m venv .venv && source .venv/bin/activate

# 1) Install deps
pip install numpy opencv-python mediapipe torch scikit-learn

# 2) Make directories for the letters youll collect
./make_seq_dirs.sh A B J Z

# 3) Capture short clips per letter (train/val)
python capture_sequence.py --label A --split train
python capture_sequence.py --label A --split val
# ...repeat for B, J, Z

# 4) Preprocess → fixed-length dataset (32 frames)
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32

# 5) Train the BiGRU
python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 \
  --out asl_seq32_gru_ABJZ.pt

# 6) Evaluate on the validation set (confusion matrix + report)
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_ABJZ.pt

# 7) Live webcam demo (press 'q' to quit)
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7

WEB trigger: In the live demo, if the emitted letters form W → E → B, the app prints a message and opens --url (default: Google). Example: --url https://www.gallaudet.edu


Repository Layout

handshapes-multiclass/
├─ make_seq_dirs.sh                 # creates sequences/train|val/<LETTER>/
├─ capture_sequence.py              # webcam capture → clip_XXX.npz (X: (T,63), tip: (T,2))
├─ prep_sequence_resampled.py       # resample clips to fixed N frames → landmarks_seq32/
├─ train_seq.py                     # train BiGRU; saves best checkpoint (.pt + stats)
├─ eval_val.py                      # evaluate on val set; prints metrics
├─ infer_seq_webcam.py              # live demo; emits letters; detects "WEB" → opens URL
├─ what_to_do.txt                   # quick, step-by-step playbook
└─ sequences/                       # created by you (after running make_seq_dirs.sh)
   ├─ train/<LETTER>/clip_XXX.npz
   └─ val/<LETTER>/clip_XXX.npz

Clip file format (clip_XXX.npz)

  • X: (T, 63) — per-frame normalized landmarks (21 points × (x, y, z))
  • tip: (T, 2) — normalized index fingertip positions (for sanity checks)

Prepared dataset (landmarks_seq32/)

  • train_X.npy, train_y.npy, val_X.npy, val_y.npy
  • class_names.json (e.g., ["A","B","J","Z"])
  • meta.json (e.g., {"frames":32,"input_dim":63})

Checkpoint (*.pt)

  • model (state_dict), classes, frames, X_mean, X_std

Normalization (consistent across capture & inference)

  1. Translate so wrist (landmark 0) is at the origin.
  2. If detected left hand, mirror x *= -1.
  3. Rotate so the middle-finger MCP (landmark 9) points along +Y.
  4. Scale all coords by the max pairwise distance among 2D landmarks.
  5. Flatten to 63 features per frame.

This ensures letter-style, not camera pose, drives classification.


Training Details

  • Model: BiGRU (input=63, hidden=128, bidirectional) → [Linear(256→128), ReLU, Dropout(0.2), Linear(128→num_classes)]
  • Optimizer: AdamW (lr=1e-3, weight_decay=1e-4)
  • Scheduler: CosineAnnealingLR (T_max = epochs)
  • Augmentation: small 2D rotate (±7°), scale (±10%), Gaussian noise (σ=0.01)
  • Normalization: global X_mean/X_std computed over train (time+batch), applied to both train & val and saved into the checkpoint.

Live Inference Behavior

  • Maintains a rolling buffer of T = frames (from the checkpoint).

  • Applies the saved X_mean/X_std.

  • EMA smoothing over softmax probs with time constant --smooth (seconds).

  • Emits a letter only if:

    • top prob ≥ --threshold (e.g., 0.8), and
    • the letter changed from the previous emission (prevents repeats).
  • Tracks a short history of emitted letters to detect W → E → B; on match:

    • prints “Detected WEB! …”
    • calls webbrowser.open(--url)

Common flags

# Camera & size
--camera 0 --width 640 --height 480

# Confidence vs. latency tradeoffs
--threshold 0.85           # higher → fewer false positives
--smooth 1.0               # higher → steadier output but more lag

# Action on sequence
--url https://example.com

Tips for High Accuracy

  • Record balanced train/val counts per class (e.g., 100 train / 20 val).
  • Keep the hand centered, well lit, and mostly single-hand (model expects 1 hand).
  • Maintain consistent distance and orientation during capture.
  • If you add new letters later, just record them, re-run preprocessing, and retrain — classes are auto-discovered from sequences/train/*.

macOS (M-series) Notes

  • PyTorch will automatically use Metal (MPS) if available (torch.backends.mps.is_available()); otherwise CPU.
  • If the webcam feed looks low FPS, try reducing --width/--height or raising --threshold / --smooth.

Troubleshooting

  • “Could not open camera” → try --camera 1 (or check macOS camera permission).
  • No detections / “No hand” on screen → improve lighting, ensure a single clear hand, check MediaPipe install.
  • Model emits wrong letters → increase --threshold, collect more data, or raise --smooth.
  • Mismatch T during inference → ensure --frames at preprocessing matches the checkpoints frames (saved & auto-used).

Commands Reference

Create class folders

./make_seq_dirs.sh A B J Z

Capture clips

python capture_sequence.py --label A --split train --seconds 0.8 --count 100
python capture_sequence.py --label A --split val   --seconds 0.8 --count 20

Prepare dataset (resample to 32 frames)

python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32

Train

python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 \
  --out asl_seq32_gru_ABJZ.pt

Evaluate

python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_ABJZ.pt

Live demo (open URL on “WEB”)

python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7 \
  --url https://www.gallaudet.edu

License

MIT


Acknowledgments

  • MediaPipe Hands for robust, fast hand landmark detection.
  • PyTorch for flexible sequence modeling on CPU/MPS.