Initial commit: handshapes multiclass project
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
216
first_attempt_landmark_hands/README.md
Normal file
216
first_attempt_landmark_hands/README.md
Normal file
@@ -0,0 +1,216 @@
|
||||
# Handshape Sequence Classifier (MediaPipe + PyTorch, macOS MPS-ready)
|
||||
|
||||
Live ASL handshape letter demo powered by MediaPipe Hands landmarks and a bidirectional GRU sequence model.
|
||||
Record short clips per letter, resample to a fixed length, train, evaluate, and run a real-time webcam demo that can react to detected letter sequences (e.g., **W → E → B** opens a URL).
|
||||
|
||||
## Features
|
||||
|
||||
* **Data capture UI:** 3-second centered countdown + top progress bar; fingertip dot feedback.
|
||||
* **Robust normalization:** wrist-anchored, left/right mirroring, rotation to +Y, scale by max pairwise distance.
|
||||
* **Fixed-length preprocessing:** linear resampling to *N* frames (default **32**).
|
||||
* **Sequence model:** BiGRU (128 hidden × 2) → MLP head; light augmentation during training.
|
||||
* **Live inference:** EMA smoothing + thresholding; emits letters only on change; detects special sequences (**WEB**) and opens a browser.
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 0) (optional) Create & activate a virtual env
|
||||
python -m venv .venv && source .venv/bin/activate
|
||||
|
||||
# 1) Install deps
|
||||
pip install numpy opencv-python mediapipe torch scikit-learn
|
||||
|
||||
# 2) Make directories for the letters you’ll collect
|
||||
./make_seq_dirs.sh A B J Z
|
||||
|
||||
# 3) Capture short clips per letter (train/val)
|
||||
python capture_sequence.py --label A --split train
|
||||
python capture_sequence.py --label A --split val
|
||||
# ...repeat for B, J, Z
|
||||
|
||||
# 4) Preprocess → fixed-length dataset (32 frames)
|
||||
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
|
||||
|
||||
# 5) Train the BiGRU
|
||||
python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 \
|
||||
--out asl_seq32_gru_ABJZ.pt
|
||||
|
||||
# 6) Evaluate on the validation set (confusion matrix + report)
|
||||
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_ABJZ.pt
|
||||
|
||||
# 7) Live webcam demo (press 'q' to quit)
|
||||
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7
|
||||
```
|
||||
|
||||
> **WEB trigger:** In the live demo, if the emitted letters form **W → E → B**, the app prints a message and opens `--url` (default: Google).
|
||||
> Example: `--url https://www.gallaudet.edu`
|
||||
|
||||
---
|
||||
|
||||
## Repository Layout
|
||||
|
||||
```
|
||||
handshapes-multiclass/
|
||||
├─ make_seq_dirs.sh # creates sequences/train|val/<LETTER>/
|
||||
├─ capture_sequence.py # webcam capture → clip_XXX.npz (X: (T,63), tip: (T,2))
|
||||
├─ prep_sequence_resampled.py # resample clips to fixed N frames → landmarks_seq32/
|
||||
├─ train_seq.py # train BiGRU; saves best checkpoint (.pt + stats)
|
||||
├─ eval_val.py # evaluate on val set; prints metrics
|
||||
├─ infer_seq_webcam.py # live demo; emits letters; detects "WEB" → opens URL
|
||||
├─ what_to_do.txt # quick, step-by-step playbook
|
||||
└─ sequences/ # created by you (after running make_seq_dirs.sh)
|
||||
├─ train/<LETTER>/clip_XXX.npz
|
||||
└─ val/<LETTER>/clip_XXX.npz
|
||||
```
|
||||
|
||||
**Clip file format (`clip_XXX.npz`)**
|
||||
|
||||
* `X`: `(T, 63)` — per-frame normalized landmarks (21 points × (x, y, z))
|
||||
* `tip`: `(T, 2)` — normalized index fingertip positions (for sanity checks)
|
||||
|
||||
**Prepared dataset (`landmarks_seq32/`)**
|
||||
|
||||
* `train_X.npy`, `train_y.npy`, `val_X.npy`, `val_y.npy`
|
||||
* `class_names.json` (e.g., `["A","B","J","Z"]`)
|
||||
* `meta.json` (e.g., `{"frames":32,"input_dim":63}`)
|
||||
|
||||
**Checkpoint (`*.pt`)**
|
||||
|
||||
* `model` (state_dict), `classes`, `frames`, `X_mean`, `X_std`
|
||||
|
||||
---
|
||||
|
||||
## Normalization (consistent across capture & inference)
|
||||
|
||||
1. Translate so **wrist** (landmark 0) is at the origin.
|
||||
2. If detected **left** hand, mirror `x *= -1`.
|
||||
3. Rotate so the **middle-finger MCP** (landmark 9) points along **+Y**.
|
||||
4. Scale all coords by the **max pairwise distance** among 2D landmarks.
|
||||
5. Flatten to **63 features** per frame.
|
||||
|
||||
This ensures letter-style, not camera pose, drives classification.
|
||||
|
||||
---
|
||||
|
||||
## Training Details
|
||||
|
||||
* **Model:** BiGRU (input=63, hidden=128, bidirectional) → `[Linear(256→128), ReLU, Dropout(0.2), Linear(128→num_classes)]`
|
||||
* **Optimizer:** AdamW (`lr=1e-3`, `weight_decay=1e-4`)
|
||||
* **Scheduler:** CosineAnnealingLR (`T_max = epochs`)
|
||||
* **Augmentation:** small 2D rotate (±7°), scale (±10%), Gaussian noise (σ=0.01)
|
||||
* **Normalization:** global `X_mean`/`X_std` computed over **train** (time+batch), applied to both train & val and saved into the checkpoint.
|
||||
|
||||
---
|
||||
|
||||
## Live Inference Behavior
|
||||
|
||||
* Maintains a rolling buffer of **T = frames** (from the checkpoint).
|
||||
* Applies the saved `X_mean`/`X_std`.
|
||||
* **EMA smoothing** over softmax probs with time constant `--smooth` (seconds).
|
||||
* Emits a letter only if:
|
||||
|
||||
* top prob ≥ `--threshold` (e.g., 0.8), **and**
|
||||
* the letter **changed** from the previous emission (prevents repeats).
|
||||
* Tracks a short history of emitted letters to detect **W → E → B**; on match:
|
||||
|
||||
* prints “Detected WEB! …”
|
||||
* calls `webbrowser.open(--url)`
|
||||
|
||||
**Common flags**
|
||||
|
||||
```bash
|
||||
# Camera & size
|
||||
--camera 0 --width 640 --height 480
|
||||
|
||||
# Confidence vs. latency tradeoffs
|
||||
--threshold 0.85 # higher → fewer false positives
|
||||
--smooth 1.0 # higher → steadier output but more lag
|
||||
|
||||
# Action on sequence
|
||||
--url https://example.com
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tips for High Accuracy
|
||||
|
||||
* Record **balanced** train/val counts per class (e.g., 100 train / 20 val).
|
||||
* Keep the hand **centered**, well lit, and mostly **single-hand** (model expects 1 hand).
|
||||
* Maintain consistent **distance** and **orientation** during capture.
|
||||
* If you add new letters later, just record them, re-run preprocessing, and retrain — classes are **auto-discovered** from `sequences/train/*`.
|
||||
|
||||
---
|
||||
|
||||
## macOS (M-series) Notes
|
||||
|
||||
* PyTorch will automatically use **Metal (MPS)** if available (`torch.backends.mps.is_available()`); otherwise CPU.
|
||||
* If the webcam feed looks low FPS, try reducing `--width/--height` or raising `--threshold` / `--smooth`.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
* **“Could not open camera”** → try `--camera 1` (or check macOS camera permission).
|
||||
* **No detections / “No hand” on screen** → improve lighting, ensure a single clear hand, check MediaPipe install.
|
||||
* **Model emits wrong letters** → increase `--threshold`, collect more data, or raise `--smooth`.
|
||||
* **Mismatch T during inference** → ensure `--frames` at preprocessing matches the checkpoint’s `frames` (saved & auto-used).
|
||||
|
||||
---
|
||||
|
||||
## Commands Reference
|
||||
|
||||
### Create class folders
|
||||
|
||||
```bash
|
||||
./make_seq_dirs.sh A B J Z
|
||||
```
|
||||
|
||||
### Capture clips
|
||||
|
||||
```bash
|
||||
python capture_sequence.py --label A --split train --seconds 0.8 --count 100
|
||||
python capture_sequence.py --label A --split val --seconds 0.8 --count 20
|
||||
```
|
||||
|
||||
### Prepare dataset (resample to 32 frames)
|
||||
|
||||
```bash
|
||||
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
|
||||
```
|
||||
|
||||
### Train
|
||||
|
||||
```bash
|
||||
python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 \
|
||||
--out asl_seq32_gru_ABJZ.pt
|
||||
```
|
||||
|
||||
### Evaluate
|
||||
|
||||
```bash
|
||||
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_ABJZ.pt
|
||||
```
|
||||
|
||||
### Live demo (open URL on “WEB”)
|
||||
|
||||
```bash
|
||||
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7 \
|
||||
--url https://www.gallaudet.edu
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
---
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
* **MediaPipe Hands** for robust, fast hand landmark detection.
|
||||
* **PyTorch** for flexible sequence modeling on CPU/MPS.
|
||||
|
||||
---
|
||||
Reference in New Issue
Block a user