Files

jared 816e34cb17 Initial commit: handshapes multiclass project

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-19 22:27:20 -05:00

11 KiB

Raw Blame History

Handshapes Multiclass (Holistic) — README

A small end-to-end pipeline that records MediaPipe Holistic landmarks, builds fixed-length sequences, trains a bidirectional GRU classifier, evaluates it, and runs a live webcam demo that recognizes classes such as words (“Mother”, “Father”, “Go”) or letters.

Quick Start

# 0) Create class folders
./make_seq_dirs.sh Mother Father Go

# 1) Capture clips (per class; adjust counts as you like)
python capture_sequence.py --label Mother --split train --seconds 0.8 --count 100
python capture_sequence.py --label Mother --split val   --seconds 0.8 --count 20
python capture_sequence.py --label Father --split train --seconds 0.8 --count 100
python capture_sequence.py --label Father --split val   --seconds 0.8 --count 20
python capture_sequence.py --label Go     --split train --seconds 0.8 --count 100
python capture_sequence.py --label Go     --split val   --seconds 0.8 --count 20

# 2) Build fixed-length dataset (32 frames/clip)
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32

# 3) Train, evaluate, and run live inference
python train_seq.py --landmarks landmarks_seq32 --out asl_seq32_gru_mother_father_go.pt
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_mother_father_go.pt
python infer_seq_webcam.py --model asl_seq32_gru_mother_father_go.pt --threshold 0.35 --smooth 0.1

Folder layout after capture:

sequences/
  train/
    Mother/ clip_001.npz ...
    Father/ clip_001.npz ...
    Go/     clip_001.npz ...
  val/
    Mother/ ...
    Father/ ...
    Go/     ...

Feature Representation (per frame)

From MediaPipe Holistic:

Right hand 21×(x,y,z) → 63
Left hand 21×(x,y,z) → 63
Face 468×(x,y,z) → 1,404
Pose 33×(x,y,z,visibility) → 132
Face-relative hand extras: wrist (x,y) + index tip (x,y) for each hand, expressed in the face-normalized frame → 8 Total = 1,670 dims per frame.

Normalization (high level)

Hands: translate to wrist, mirror left → right, rotate so middle-finger MCP points +Y, scale by max pairwise distance.
Face: center at eye midpoint, scale by inter-ocular distance, rotate to align eyeline horizontally.
Pose: center at shoulder midpoint, scale by shoulder width, rotate shoulders horizontal.
Extras: per-hand wrist/tip projected into the face frame so the model retains where the hand is relative to the face (critical for signs like Mother vs Father).

How the Pipeline Works

1) `make_seq_dirs.sh`

Creates the directory scaffolding under sequences/ for any labels you pass (letters or words).

Usage: ./make_seq_dirs.sh Mother Father Go
Why: Keeps data organized as train/ and val/ per class.

2) `capture_sequence.py`

Records short clips from your webcam and saves per-frame feature vectors into compressed .npz files.

Key behaviors

Uses MediaPipe Holistic to extract right/left hands, full face mesh, and pose.
Computes normalized features + face-relative extras.
Writes each clip as sequences/<split>/<label>/clip_XXX.npz with an array X of shape (T, 1670).

Common flags

--label (string): class name (e.g., Mother, Go).
--split: train or val.
--seconds (float): clip length; 0.8s pairs well with 32 frames.
--count (int): how many clips to record in one run.
--camera, --width, --height: webcam settings.
--holistic-complexity (0|1|2): higher is more accurate but slower.
UI niceties: 3-second countdown; on-screen progress bar; optional fingertip markers.

3) `prep_sequence_resampled.py`

Aggregates all clip_*.npz files into a fixed-length dataset.

What it does

Loads each clip’s X (T, 1670) and linearly resamples to exactly N frames (default 32), resulting in (N, 1670).
Stacks clips into:
- train_X.npy (Nclips, Nframes, F)
- train_y.npy (Nclips,)
- val_X.npy, val_y.npy
- class_names.json (sorted list of class names)
- meta.json with { "frames": N, "input_dim": F }

Flags

--in root of sequences/
--out dataset folder (e.g., landmarks_seq32)
--frames number of frames per clip after resampling (e.g., 16, 32, 64)

Tip: Reducing --frames (e.g., 16) lowers first-prediction latency in the live demo, at the cost of some stability/accuracy.

4) `train_seq.py`

Trains a bidirectional GRU classifier on the resampled sequences.

What it does

Loads train_*.npy / val_*.npy, class_names.json, and meta.json.
Computes feature-wise mean/std on the train set; normalizes train/val.
Model: GRU(input_dim → 128 hidden, bidirectional) → ReLU → Dropout → Linear(num_classes).
Tracks best val accuracy; saves a checkpoint containing:
- model weights
- classes, frames
- X_mean, X_std (for inference normalization)

Flags

--epochs, --batch, --lr: typical training hyperparams.
--out: model file (e.g., asl_seq32_gru_mother_father_go.pt)

5) `eval_val.py`

Evaluates your saved model on the validation set.

What it does

Loads val_X.npy, val_y.npy, class_names.json, meta.json, and the *.pt checkpoint.
Normalizes val_X using the training mean/std stored in the checkpoint.
Prints confusion matrix and a full classification report (precision/recall/F1/accuracy).

Usage

python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_mother_father_go.pt

6) `infer_seq_webcam.py`

Live webcam demo that streams landmarks, builds a rolling buffer, and classifies in real time.

Key behaviors

Maintains a rolling window of T frames (from the model’s frames value; default 32).
No prediction until the buffer is full → expect a short warm-up.
Applies the same normalization using the model’s stored X_mean/X_std.
Optional EMA smoothing over probabilities for stability.
Example action hook included: spell “W → E → B” to open a URL.

Common flags

--threshold (e.g., 0.35): minimum top-class probability to “emit” a label.
--smooth (seconds): temporal EMA (0 disables). Lower = more responsive; higher = steadier.
--holistic-complexity, --det-thresh: detector accuracy/sensitivity tradeoffs.

Parameters & Practical Tips

Threshold vs Smooth
- Lower --threshold (e.g., 0.3–0.4) → more sensitive, but may produce more false positives.
- --smooth ≈ 0.1–0.3s → responsive; 0.5–0.8s → steadier but laggier.
Frames (--frames in prep)
- 16–24 frames: snappier first detection.
- 32 frames: balanced.
- 64 frames: more context, slower to first prediction.
Data balance & variety
- Similar clip counts per class help training.
- Vary lighting, small head angles, distance, and speed of motion.
- For location-based signs (e.g., Mother vs Father), the face-relative extras help the model disambiguate.

File-by-File Summary

File	Purpose	Inputs → Outputs
`make_seq_dirs.sh`	Creates `train/` and `val/` subfolders for each label you pass in.	Labels → `sequences/train/<label>/`, `sequences/val/<label>/`
`capture_sequence.py`	Captures webcam frames, extracts Holistic landmarks, normalizes, builds per-frame 1,670-D features, and saves each clip.	Webcam → `sequences/<split>/<label>/clip_XXX.npz` (X: `(T,1670)`)
`prep_sequence_resampled.py`	Resamples variable-length clips to fixed length; aggregates into train/val arrays and writes metadata.	`sequences/` → `landmarks_seq32/{train_X,train_y,val_X,val_y}.npy`, `class_names.json`, `meta.json`
`train_seq.py`	Trains a BiGRU multiclass classifier with normalization and simple augmentation.	`landmarks_seq32` → `asl_seq32_gru_*.pt` (includes model, classes, frames, mean/std)
`eval_val.py`	Evaluates the saved model on the validation split; prints metrics.	Model + `landmarks_seq32` → console metrics
`infer_seq_webcam.py`	Streams webcam landmarks, builds rolling sequences, classifies in real time; optional action (e.g., open URL on sequence).	Webcam + `asl_seq32_gru_*.pt` → on-screen predictions/actions
`what_to_do.txt`	Step-by-step command cheat-sheet reflecting the current multi-word workflow.	—

Troubleshooting

“No classes found in sequences/train/” Ensure class folders exist: sequences/train/<Label>/ and sequences/val/<Label>/, and that they contain clip_*.npz.
No live prediction initially Expected; the model needs the first T frames to fill the buffer.
Lag or low FPS Try --holistic-complexity 0, reduce camera resolution, or use a smaller --frames and retrain.
Overconfident but wrong Raise --threshold, increase --smooth, or record more varied data per class (especially negatives or near-misses).

Add/Remove Classes

To add a class (e.g., Go): create dirs, capture clips, rerun prep, retrain, re-eval.
To remove/replace a class: delete its folders or rename, then rerun prep and retrain.

Dependencies

Python 3.x, numpy, opencv-python, mediapipe, torch, scikit-learn (for evaluation).
macOS with Apple Silicon can use MPS acceleration automatically (already handled in the code).

Notes

Labels are arbitrary strings (not restricted to A–Z).
Features are zero-filled for missing parts in a frame (e.g., if a hand isn’t detected) to keep dimensions stable.
The face is used as a global anchor for geometry; keeping the face visible improves robustness.

11 KiB Raw Blame History Unescape Escape

Handshapes Multiclass (Holistic) — README

Quick Start

Feature Representation (per frame)

Normalization (high level)

How the Pipeline Works

1) make_seq_dirs.sh

2) capture_sequence.py

3) prep_sequence_resampled.py

4) train_seq.py

5) eval_val.py

6) infer_seq_webcam.py

Parameters & Practical Tips

File-by-File Summary

Troubleshooting

Add/Remove Classes

Dependencies

Notes

11 KiB

Raw Blame History

1) `make_seq_dirs.sh`

2) `capture_sequence.py`

3) `prep_sequence_resampled.py`

4) `train_seq.py`

5) `eval_val.py`

6) `infer_seq_webcam.py`