Initial commit: handshapes multiclass project

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 22:27:20 -05:00
commit 816e34cb17
22 changed files with 2820 additions and 0 deletions
--- a/first_attempt_landmark_hands/README.md
+++ b/first_attempt_landmark_hands/README.md
@@ -0,0 +1,216 @@
+# Handshape Sequence Classifier (MediaPipe + PyTorch, macOS MPS-ready)
+
+Live ASL handshape letter demo powered by MediaPipe Hands landmarks and a bidirectional GRU sequence model.
+Record short clips per letter, resample to a fixed length, train, evaluate, and run a real-time webcam demo that can react to detected letter sequences (e.g., **W → E → B** opens a URL).
+
+## Features
+
+* **Data capture UI:** 3-second centered countdown + top progress bar; fingertip dot feedback.
+* **Robust normalization:** wrist-anchored, left/right mirroring, rotation to +Y, scale by max pairwise distance.
+* **Fixed-length preprocessing:** linear resampling to *N* frames (default **32**).
+* **Sequence model:** BiGRU (128 hidden × 2) → MLP head; light augmentation during training.
+* **Live inference:** EMA smoothing + thresholding; emits letters only on change; detects special sequences (**WEB**) and opens a browser.
+
+---
+
+## Quick Start
+
+```bash
+# 0) (optional) Create & activate a virtual env
+python -m venv .venv && source .venv/bin/activate
+
+# 1) Install deps
+pip install numpy opencv-python mediapipe torch scikit-learn
+
+# 2) Make directories for the letters you’ll collect
+./make_seq_dirs.sh A B J Z
+
+# 3) Capture short clips per letter (train/val)
+python capture_sequence.py --label A --split train
+python capture_sequence.py --label A --split val
+# ...repeat for B, J, Z
+
+# 4) Preprocess → fixed-length dataset (32 frames)
+python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
+
+# 5) Train the BiGRU
+python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 \
+  --out asl_seq32_gru_ABJZ.pt
+
+# 6) Evaluate on the validation set (confusion matrix + report)
+python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_ABJZ.pt
+
+# 7) Live webcam demo (press 'q' to quit)
+python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7
+```
+
+> **WEB trigger:** In the live demo, if the emitted letters form **W → E → B**, the app prints a message and opens `--url` (default: Google).
+> Example: `--url https://www.gallaudet.edu`
+
+---
+
+## Repository Layout
+
+```
+handshapes-multiclass/
+├─ make_seq_dirs.sh                 # creates sequences/train|val/<LETTER>/
+├─ capture_sequence.py              # webcam capture → clip_XXX.npz (X: (T,63), tip: (T,2))
+├─ prep_sequence_resampled.py       # resample clips to fixed N frames → landmarks_seq32/
+├─ train_seq.py                     # train BiGRU; saves best checkpoint (.pt + stats)
+├─ eval_val.py                      # evaluate on val set; prints metrics
+├─ infer_seq_webcam.py              # live demo; emits letters; detects "WEB" → opens URL
+├─ what_to_do.txt                   # quick, step-by-step playbook
+└─ sequences/                       # created by you (after running make_seq_dirs.sh)
+   ├─ train/<LETTER>/clip_XXX.npz
+   └─ val/<LETTER>/clip_XXX.npz
+```
+
+**Clip file format (`clip_XXX.npz`)**
+
+* `X`: `(T, 63)` — per-frame normalized landmarks (21 points × (x, y, z))
+* `tip`: `(T, 2)` — normalized index fingertip positions (for sanity checks)
+
+**Prepared dataset (`landmarks_seq32/`)**
+
+* `train_X.npy`, `train_y.npy`, `val_X.npy`, `val_y.npy`
+* `class_names.json` (e.g., `["A","B","J","Z"]`)
+* `meta.json` (e.g., `{"frames":32,"input_dim":63}`)
+
+**Checkpoint (`*.pt`)**
+
+* `model` (state_dict), `classes`, `frames`, `X_mean`, `X_std`
+
+---
+
+## Normalization (consistent across capture & inference)
+
+1. Translate so **wrist** (landmark 0) is at the origin.
+2. If detected **left** hand, mirror `x *= -1`.
+3. Rotate so the **middle-finger MCP** (landmark 9) points along **+Y**.
+4. Scale all coords by the **max pairwise distance** among 2D landmarks.
+5. Flatten to **63 features** per frame.
+
+This ensures letter-style, not camera pose, drives classification.
+
+---
+
+## Training Details
+
+* **Model:** BiGRU (input=63, hidden=128, bidirectional) → `[Linear(256→128), ReLU, Dropout(0.2), Linear(128→num_classes)]`
+* **Optimizer:** AdamW (`lr=1e-3`, `weight_decay=1e-4`)
+* **Scheduler:** CosineAnnealingLR (`T_max = epochs`)
+* **Augmentation:** small 2D rotate (±7°), scale (±10%), Gaussian noise (σ=0.01)
+* **Normalization:** global `X_mean`/`X_std` computed over **train** (time+batch), applied to both train & val and saved into the checkpoint.
+
+---
+
+## Live Inference Behavior
+
+* Maintains a rolling buffer of **T = frames** (from the checkpoint).
+* Applies the saved `X_mean`/`X_std`.
+* **EMA smoothing** over softmax probs with time constant `--smooth` (seconds).
+* Emits a letter only if:
+
+  * top prob ≥ `--threshold` (e.g., 0.8), **and**
+  * the letter **changed** from the previous emission (prevents repeats).
+* Tracks a short history of emitted letters to detect **W → E → B**; on match:
+
+  * prints “Detected WEB! …”
+  * calls `webbrowser.open(--url)`
+
+**Common flags**
+
+```bash
+# Camera & size
+--camera 0 --width 640 --height 480
+
+# Confidence vs. latency tradeoffs
+--threshold 0.85           # higher → fewer false positives
+--smooth 1.0               # higher → steadier output but more lag
+
+# Action on sequence
+--url https://example.com
+```
+
+---
+
+## Tips for High Accuracy
+
+* Record **balanced** train/val counts per class (e.g., 100 train / 20 val).
+* Keep the hand **centered**, well lit, and mostly **single-hand** (model expects 1 hand).
+* Maintain consistent **distance** and **orientation** during capture.
+* If you add new letters later, just record them, re-run preprocessing, and retrain — classes are **auto-discovered** from `sequences/train/*`.
+
+---
+
+## macOS (M-series) Notes
+
+* PyTorch will automatically use **Metal (MPS)** if available (`torch.backends.mps.is_available()`); otherwise CPU.
+* If the webcam feed looks low FPS, try reducing `--width/--height` or raising `--threshold` / `--smooth`.
+
+---
+
+## Troubleshooting
+
+* **“Could not open camera”** → try `--camera 1` (or check macOS camera permission).
+* **No detections / “No hand” on screen** → improve lighting, ensure a single clear hand, check MediaPipe install.
+* **Model emits wrong letters** → increase `--threshold`, collect more data, or raise `--smooth`.
+* **Mismatch T during inference** → ensure `--frames` at preprocessing matches the checkpoint’s `frames` (saved & auto-used).
+
+---
+
+## Commands Reference
+
+### Create class folders
+
+```bash
+./make_seq_dirs.sh A B J Z
+```
+
+### Capture clips
+
+```bash
+python capture_sequence.py --label A --split train --seconds 0.8 --count 100
+python capture_sequence.py --label A --split val   --seconds 0.8 --count 20
+```
+
+### Prepare dataset (resample to 32 frames)
+
+```bash
+python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
+```
+
+### Train
+
+```bash
+python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 \
+  --out asl_seq32_gru_ABJZ.pt
+```
+
+### Evaluate
+
+```bash
+python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_ABJZ.pt
+```
+
+### Live demo (open URL on “WEB”)
+
+```bash
+python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7 \
+  --url https://www.gallaudet.edu
+```
+
+---
+
+## License
+
+MIT
+
+---
+
+## Acknowledgments
+
+* **MediaPipe Hands** for robust, fast hand landmark detection.
+* **PyTorch** for flexible sequence modeling on CPU/MPS.
+
+---