Files
2026-01-19 22:27:20 -05:00

256 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Handshapes Multiclass (Holistic) — README
A small end-to-end pipeline that records MediaPipe **Holistic** landmarks, builds fixed-length sequences, trains a **bidirectional GRU** classifier, evaluates it, and runs a **live webcam demo** that recognizes classes such as words (“Mother”, “Father”, “Go”) or letters.
---
## Quick Start
```bash
# 0) Create class folders
./make_seq_dirs.sh Mother Father Go
# 1) Capture clips (per class; adjust counts as you like)
python capture_sequence.py --label Mother --split train --seconds 0.8 --count 100
python capture_sequence.py --label Mother --split val --seconds 0.8 --count 20
python capture_sequence.py --label Father --split train --seconds 0.8 --count 100
python capture_sequence.py --label Father --split val --seconds 0.8 --count 20
python capture_sequence.py --label Go --split train --seconds 0.8 --count 100
python capture_sequence.py --label Go --split val --seconds 0.8 --count 20
# 2) Build fixed-length dataset (32 frames/clip)
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
# 3) Train, evaluate, and run live inference
python train_seq.py --landmarks landmarks_seq32 --out asl_seq32_gru_mother_father_go.pt
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_mother_father_go.pt
python infer_seq_webcam.py --model asl_seq32_gru_mother_father_go.pt --threshold 0.35 --smooth 0.1
```
Folder layout after capture:
```
sequences/
train/
Mother/ clip_001.npz ...
Father/ clip_001.npz ...
Go/ clip_001.npz ...
val/
Mother/ ...
Father/ ...
Go/ ...
```
---
## Feature Representation (per frame)
From MediaPipe **Holistic**:
* **Right hand** 21×(x,y,z) → 63
* **Left hand** 21×(x,y,z) → 63
* **Face** 468×(x,y,z) → 1,404
* **Pose** 33×(x,y,z,visibility) → 132
* **Face-relative hand extras**: wrist (x,y) + index tip (x,y) for each hand, expressed in the face-normalized frame → 8
**Total** = **1,670 dims** per frame.
### Normalization (high level)
* Hands: translate to wrist, mirror left → right, rotate so middle-finger MCP points +Y, scale by max pairwise distance.
* Face: center at eye midpoint, scale by inter-ocular distance, rotate to align eyeline horizontally.
* Pose: center at shoulder midpoint, scale by shoulder width, rotate shoulders horizontal.
* Extras: per-hand wrist/tip projected into the face frame so the model retains *where* the hand is relative to the face (critical for signs like **Mother** vs **Father**).
---
## How the Pipeline Works
### 1) `make_seq_dirs.sh`
Creates the directory scaffolding under `sequences/` for any labels you pass (letters or words).
* **Usage:** `./make_seq_dirs.sh Mother Father Go`
* **Why:** Keeps data organized as `train/` and `val/` per class.
---
### 2) `capture_sequence.py`
Records short clips from your webcam and saves per-frame **feature vectors** into compressed `.npz` files.
**Key behaviors**
* Uses **MediaPipe Holistic** to extract right/left hands, full face mesh, and pose.
* Computes normalized features + face-relative extras.
* Writes each clip as `sequences/<split>/<label>/clip_XXX.npz` with an array `X` of shape `(T, 1670)`.
**Common flags**
* `--label` (string): class name (e.g., `Mother`, `Go`).
* `--split`: `train` or `val`.
* `--seconds` (float): clip length; 0.8s pairs well with 32 frames.
* `--count` (int): how many clips to record in one run.
* `--camera`, `--width`, `--height`: webcam settings.
* `--holistic-complexity` (`0|1|2`): higher is more accurate but slower.
* UI niceties: 3-second countdown; on-screen progress bar; optional fingertip markers.
---
### 3) `prep_sequence_resampled.py`
Aggregates all `clip_*.npz` files into a fixed-length dataset.
**What it does**
* Loads each clips `X` `(T, 1670)` and **linearly resamples** to exactly `N` frames (default `32`), resulting in `(N, 1670)`.
* Stacks clips into:
* `train_X.npy` `(Nclips, Nframes, F)`
* `train_y.npy` `(Nclips,)`
* `val_X.npy`, `val_y.npy`
* `class_names.json` (sorted list of class names)
* `meta.json` with `{ "frames": N, "input_dim": F }`
**Flags**
* `--in` root of `sequences/`
* `--out` dataset folder (e.g., `landmarks_seq32`)
* `--frames` number of frames per clip after resampling (e.g., `16`, `32`, `64`)
> Tip: Reducing `--frames` (e.g., 16) lowers first-prediction latency in the live demo, at the cost of some stability/accuracy.
---
### 4) `train_seq.py`
Trains a **bidirectional GRU** classifier on the resampled sequences.
**What it does**
* Loads `train_*.npy` / `val_*.npy`, `class_names.json`, and `meta.json`.
* Computes **feature-wise mean/std** on the train set; normalizes train/val.
* Model: `GRU(input_dim → 128 hidden, bidirectional) → ReLU → Dropout → Linear(num_classes)`.
* Tracks best **val accuracy**; saves a checkpoint containing:
* `model` weights
* `classes`, `frames`
* `X_mean`, `X_std` (for inference normalization)
**Flags**
* `--epochs`, `--batch`, `--lr`: typical training hyperparams.
* `--out`: model file (e.g., `asl_seq32_gru_mother_father_go.pt`)
---
### 5) `eval_val.py`
Evaluates your saved model on the validation set.
**What it does**
* Loads `val_X.npy`, `val_y.npy`, `class_names.json`, `meta.json`, and the `*.pt` checkpoint.
* Normalizes `val_X` using the **training** mean/std stored in the checkpoint.
* Prints **confusion matrix** and a full **classification report** (precision/recall/F1/accuracy).
**Usage**
```bash
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_mother_father_go.pt
```
---
### 6) `infer_seq_webcam.py`
Live webcam demo that streams landmarks, builds a rolling buffer, and classifies in real time.
**Key behaviors**
* Maintains a **rolling window** of `T` frames (from the models `frames` value; default 32).
* No prediction until the buffer is full → expect a short warm-up.
* Applies the same normalization using the models stored `X_mean`/`X_std`.
* Optional **EMA smoothing** over probabilities for stability.
* Example **action hook** included: spell “W → E → B” to open a URL.
**Common flags**
* `--threshold` (e.g., `0.35`): minimum top-class probability to “emit” a label.
* `--smooth` (seconds): temporal EMA (0 disables). Lower = more responsive; higher = steadier.
* `--holistic-complexity`, `--det-thresh`: detector accuracy/sensitivity tradeoffs.
---
## Parameters & Practical Tips
* **Threshold vs Smooth**
* Lower `--threshold` (e.g., `0.30.4`) → more sensitive, but may produce more false positives.
* `--smooth``0.10.3s` → responsive; `0.50.8s` → steadier but laggier.
* **Frames (`--frames` in prep)**
* `1624` frames: snappier first detection.
* `32` frames: balanced.
* `64` frames: more context, slower to first prediction.
* **Data balance & variety**
* Similar clip counts per class help training.
* Vary lighting, small head angles, distance, and speed of motion.
* For location-based signs (e.g., Mother vs Father), the **face-relative extras** help the model disambiguate.
---
## File-by-File Summary
| File | Purpose | Inputs → Outputs |
| ---------------------------- | -------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| `make_seq_dirs.sh` | Creates `train/` and `val/` subfolders for each label you pass in. | Labels → `sequences/train/<label>/`, `sequences/val/<label>/` |
| `capture_sequence.py` | Captures webcam frames, extracts Holistic landmarks, normalizes, builds per-frame 1,670-D features, and saves each clip. | Webcam → `sequences/<split>/<label>/clip_XXX.npz` (X: `(T,1670)`) |
| `prep_sequence_resampled.py` | Resamples variable-length clips to fixed length; aggregates into train/val arrays and writes metadata. | `sequences/``landmarks_seq32/{train_X,train_y,val_X,val_y}.npy`, `class_names.json`, `meta.json` |
| `train_seq.py` | Trains a BiGRU multiclass classifier with normalization and simple augmentation. | `landmarks_seq32``asl_seq32_gru_*.pt` (includes model, classes, frames, mean/std) |
| `eval_val.py` | Evaluates the saved model on the validation split; prints metrics. | Model + `landmarks_seq32` → console metrics |
| `infer_seq_webcam.py` | Streams webcam landmarks, builds rolling sequences, classifies in real time; optional action (e.g., open URL on sequence). | Webcam + `asl_seq32_gru_*.pt` → on-screen predictions/actions |
| `what_to_do.txt` | Step-by-step command cheat-sheet reflecting the current multi-word workflow. | — |
---
## Troubleshooting
* **“No classes found in sequences/train/”**
Ensure class folders exist: `sequences/train/<Label>/` and `sequences/val/<Label>/`, and that they contain `clip_*.npz`.
* **No live prediction initially**
Expected; the model needs the first **T** frames to fill the buffer.
* **Lag or low FPS**
Try `--holistic-complexity 0`, reduce camera resolution, or use a smaller `--frames` and retrain.
* **Overconfident but wrong**
Raise `--threshold`, increase `--smooth`, or record more varied data per class (especially negatives or near-misses).
---
## Add/Remove Classes
* To **add** a class (e.g., `Go`): create dirs, capture clips, rerun **prep**, retrain, re-eval.
* To **remove/replace** a class: delete its folders or rename, **then** rerun **prep** and retrain.
---
## Dependencies
* Python 3.x, `numpy`, `opencv-python`, `mediapipe`, `torch`, `scikit-learn` (for evaluation).
* macOS with Apple Silicon can use MPS acceleration automatically (already handled in the code).
---
## Notes
* Labels are **arbitrary strings** (not restricted to AZ).
* Features are **zero-filled** for missing parts in a frame (e.g., if a hand isnt detected) to keep dimensions stable.
* The face is used as a global anchor for geometry; keeping the face visible improves robustness.
---