256 lines
11 KiB
Markdown
256 lines
11 KiB
Markdown
# Handshapes Multiclass (Holistic) — README
|
||
|
||
A small end-to-end pipeline that records MediaPipe **Holistic** landmarks, builds fixed-length sequences, trains a **bidirectional GRU** classifier, evaluates it, and runs a **live webcam demo** that recognizes classes such as words (“Mother”, “Father”, “Go”) or letters.
|
||
|
||
---
|
||
|
||
## Quick Start
|
||
|
||
```bash
|
||
# 0) Create class folders
|
||
./make_seq_dirs.sh Mother Father Go
|
||
|
||
# 1) Capture clips (per class; adjust counts as you like)
|
||
python capture_sequence.py --label Mother --split train --seconds 0.8 --count 100
|
||
python capture_sequence.py --label Mother --split val --seconds 0.8 --count 20
|
||
python capture_sequence.py --label Father --split train --seconds 0.8 --count 100
|
||
python capture_sequence.py --label Father --split val --seconds 0.8 --count 20
|
||
python capture_sequence.py --label Go --split train --seconds 0.8 --count 100
|
||
python capture_sequence.py --label Go --split val --seconds 0.8 --count 20
|
||
|
||
# 2) Build fixed-length dataset (32 frames/clip)
|
||
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
|
||
|
||
# 3) Train, evaluate, and run live inference
|
||
python train_seq.py --landmarks landmarks_seq32 --out asl_seq32_gru_mother_father_go.pt
|
||
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_mother_father_go.pt
|
||
python infer_seq_webcam.py --model asl_seq32_gru_mother_father_go.pt --threshold 0.35 --smooth 0.1
|
||
```
|
||
|
||
Folder layout after capture:
|
||
|
||
```
|
||
sequences/
|
||
train/
|
||
Mother/ clip_001.npz ...
|
||
Father/ clip_001.npz ...
|
||
Go/ clip_001.npz ...
|
||
val/
|
||
Mother/ ...
|
||
Father/ ...
|
||
Go/ ...
|
||
```
|
||
|
||
---
|
||
|
||
## Feature Representation (per frame)
|
||
|
||
From MediaPipe **Holistic**:
|
||
|
||
* **Right hand** 21×(x,y,z) → 63
|
||
* **Left hand** 21×(x,y,z) → 63
|
||
* **Face** 468×(x,y,z) → 1,404
|
||
* **Pose** 33×(x,y,z,visibility) → 132
|
||
* **Face-relative hand extras**: wrist (x,y) + index tip (x,y) for each hand, expressed in the face-normalized frame → 8
|
||
**Total** = **1,670 dims** per frame.
|
||
|
||
### Normalization (high level)
|
||
|
||
* Hands: translate to wrist, mirror left → right, rotate so middle-finger MCP points +Y, scale by max pairwise distance.
|
||
* Face: center at eye midpoint, scale by inter-ocular distance, rotate to align eyeline horizontally.
|
||
* Pose: center at shoulder midpoint, scale by shoulder width, rotate shoulders horizontal.
|
||
* Extras: per-hand wrist/tip projected into the face frame so the model retains *where* the hand is relative to the face (critical for signs like **Mother** vs **Father**).
|
||
|
||
---
|
||
|
||
## How the Pipeline Works
|
||
|
||
### 1) `make_seq_dirs.sh`
|
||
|
||
Creates the directory scaffolding under `sequences/` for any labels you pass (letters or words).
|
||
|
||
* **Usage:** `./make_seq_dirs.sh Mother Father Go`
|
||
* **Why:** Keeps data organized as `train/` and `val/` per class.
|
||
|
||
---
|
||
|
||
### 2) `capture_sequence.py`
|
||
|
||
Records short clips from your webcam and saves per-frame **feature vectors** into compressed `.npz` files.
|
||
|
||
**Key behaviors**
|
||
|
||
* Uses **MediaPipe Holistic** to extract right/left hands, full face mesh, and pose.
|
||
* Computes normalized features + face-relative extras.
|
||
* Writes each clip as `sequences/<split>/<label>/clip_XXX.npz` with an array `X` of shape `(T, 1670)`.
|
||
|
||
**Common flags**
|
||
|
||
* `--label` (string): class name (e.g., `Mother`, `Go`).
|
||
* `--split`: `train` or `val`.
|
||
* `--seconds` (float): clip length; 0.8s pairs well with 32 frames.
|
||
* `--count` (int): how many clips to record in one run.
|
||
* `--camera`, `--width`, `--height`: webcam settings.
|
||
* `--holistic-complexity` (`0|1|2`): higher is more accurate but slower.
|
||
* UI niceties: 3-second countdown; on-screen progress bar; optional fingertip markers.
|
||
|
||
---
|
||
|
||
### 3) `prep_sequence_resampled.py`
|
||
|
||
Aggregates all `clip_*.npz` files into a fixed-length dataset.
|
||
|
||
**What it does**
|
||
|
||
* Loads each clip’s `X` `(T, 1670)` and **linearly resamples** to exactly `N` frames (default `32`), resulting in `(N, 1670)`.
|
||
* Stacks clips into:
|
||
|
||
* `train_X.npy` `(Nclips, Nframes, F)`
|
||
* `train_y.npy` `(Nclips,)`
|
||
* `val_X.npy`, `val_y.npy`
|
||
* `class_names.json` (sorted list of class names)
|
||
* `meta.json` with `{ "frames": N, "input_dim": F }`
|
||
|
||
**Flags**
|
||
|
||
* `--in` root of `sequences/`
|
||
* `--out` dataset folder (e.g., `landmarks_seq32`)
|
||
* `--frames` number of frames per clip after resampling (e.g., `16`, `32`, `64`)
|
||
|
||
> Tip: Reducing `--frames` (e.g., 16) lowers first-prediction latency in the live demo, at the cost of some stability/accuracy.
|
||
|
||
---
|
||
|
||
### 4) `train_seq.py`
|
||
|
||
Trains a **bidirectional GRU** classifier on the resampled sequences.
|
||
|
||
**What it does**
|
||
|
||
* Loads `train_*.npy` / `val_*.npy`, `class_names.json`, and `meta.json`.
|
||
* Computes **feature-wise mean/std** on the train set; normalizes train/val.
|
||
* Model: `GRU(input_dim → 128 hidden, bidirectional) → ReLU → Dropout → Linear(num_classes)`.
|
||
* Tracks best **val accuracy**; saves a checkpoint containing:
|
||
|
||
* `model` weights
|
||
* `classes`, `frames`
|
||
* `X_mean`, `X_std` (for inference normalization)
|
||
|
||
**Flags**
|
||
|
||
* `--epochs`, `--batch`, `--lr`: typical training hyperparams.
|
||
* `--out`: model file (e.g., `asl_seq32_gru_mother_father_go.pt`)
|
||
|
||
---
|
||
|
||
### 5) `eval_val.py`
|
||
|
||
Evaluates your saved model on the validation set.
|
||
|
||
**What it does**
|
||
|
||
* Loads `val_X.npy`, `val_y.npy`, `class_names.json`, `meta.json`, and the `*.pt` checkpoint.
|
||
* Normalizes `val_X` using the **training** mean/std stored in the checkpoint.
|
||
* Prints **confusion matrix** and a full **classification report** (precision/recall/F1/accuracy).
|
||
|
||
**Usage**
|
||
|
||
```bash
|
||
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_mother_father_go.pt
|
||
```
|
||
|
||
---
|
||
|
||
### 6) `infer_seq_webcam.py`
|
||
|
||
Live webcam demo that streams landmarks, builds a rolling buffer, and classifies in real time.
|
||
|
||
**Key behaviors**
|
||
|
||
* Maintains a **rolling window** of `T` frames (from the model’s `frames` value; default 32).
|
||
* No prediction until the buffer is full → expect a short warm-up.
|
||
* Applies the same normalization using the model’s stored `X_mean`/`X_std`.
|
||
* Optional **EMA smoothing** over probabilities for stability.
|
||
* Example **action hook** included: spell “W → E → B” to open a URL.
|
||
|
||
**Common flags**
|
||
|
||
* `--threshold` (e.g., `0.35`): minimum top-class probability to “emit” a label.
|
||
* `--smooth` (seconds): temporal EMA (0 disables). Lower = more responsive; higher = steadier.
|
||
* `--holistic-complexity`, `--det-thresh`: detector accuracy/sensitivity tradeoffs.
|
||
|
||
---
|
||
|
||
## Parameters & Practical Tips
|
||
|
||
* **Threshold vs Smooth**
|
||
|
||
* Lower `--threshold` (e.g., `0.3–0.4`) → more sensitive, but may produce more false positives.
|
||
* `--smooth` ≈ `0.1–0.3s` → responsive; `0.5–0.8s` → steadier but laggier.
|
||
|
||
* **Frames (`--frames` in prep)**
|
||
|
||
* `16–24` frames: snappier first detection.
|
||
* `32` frames: balanced.
|
||
* `64` frames: more context, slower to first prediction.
|
||
|
||
* **Data balance & variety**
|
||
|
||
* Similar clip counts per class help training.
|
||
* Vary lighting, small head angles, distance, and speed of motion.
|
||
* For location-based signs (e.g., Mother vs Father), the **face-relative extras** help the model disambiguate.
|
||
|
||
---
|
||
|
||
## File-by-File Summary
|
||
|
||
| File | Purpose | Inputs → Outputs |
|
||
| ---------------------------- | -------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
|
||
| `make_seq_dirs.sh` | Creates `train/` and `val/` subfolders for each label you pass in. | Labels → `sequences/train/<label>/`, `sequences/val/<label>/` |
|
||
| `capture_sequence.py` | Captures webcam frames, extracts Holistic landmarks, normalizes, builds per-frame 1,670-D features, and saves each clip. | Webcam → `sequences/<split>/<label>/clip_XXX.npz` (X: `(T,1670)`) |
|
||
| `prep_sequence_resampled.py` | Resamples variable-length clips to fixed length; aggregates into train/val arrays and writes metadata. | `sequences/` → `landmarks_seq32/{train_X,train_y,val_X,val_y}.npy`, `class_names.json`, `meta.json` |
|
||
| `train_seq.py` | Trains a BiGRU multiclass classifier with normalization and simple augmentation. | `landmarks_seq32` → `asl_seq32_gru_*.pt` (includes model, classes, frames, mean/std) |
|
||
| `eval_val.py` | Evaluates the saved model on the validation split; prints metrics. | Model + `landmarks_seq32` → console metrics |
|
||
| `infer_seq_webcam.py` | Streams webcam landmarks, builds rolling sequences, classifies in real time; optional action (e.g., open URL on sequence). | Webcam + `asl_seq32_gru_*.pt` → on-screen predictions/actions |
|
||
| `what_to_do.txt` | Step-by-step command cheat-sheet reflecting the current multi-word workflow. | — |
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
* **“No classes found in sequences/train/”**
|
||
Ensure class folders exist: `sequences/train/<Label>/` and `sequences/val/<Label>/`, and that they contain `clip_*.npz`.
|
||
|
||
* **No live prediction initially**
|
||
Expected; the model needs the first **T** frames to fill the buffer.
|
||
|
||
* **Lag or low FPS**
|
||
Try `--holistic-complexity 0`, reduce camera resolution, or use a smaller `--frames` and retrain.
|
||
|
||
* **Overconfident but wrong**
|
||
Raise `--threshold`, increase `--smooth`, or record more varied data per class (especially negatives or near-misses).
|
||
|
||
---
|
||
|
||
## Add/Remove Classes
|
||
|
||
* To **add** a class (e.g., `Go`): create dirs, capture clips, rerun **prep**, retrain, re-eval.
|
||
* To **remove/replace** a class: delete its folders or rename, **then** rerun **prep** and retrain.
|
||
|
||
---
|
||
|
||
## Dependencies
|
||
|
||
* Python 3.x, `numpy`, `opencv-python`, `mediapipe`, `torch`, `scikit-learn` (for evaluation).
|
||
* macOS with Apple Silicon can use MPS acceleration automatically (already handled in the code).
|
||
|
||
---
|
||
|
||
## Notes
|
||
|
||
* Labels are **arbitrary strings** (not restricted to A–Z).
|
||
* Features are **zero-filled** for missing parts in a frame (e.g., if a hand isn’t detected) to keep dimensions stable.
|
||
* The face is used as a global anchor for geometry; keeping the face visible improves robustness.
|
||
|
||
---
|