Files
slr_handshapes_locations/README.md
2026-01-19 22:27:20 -05:00

11 KiB
Raw Blame History

Handshapes Multiclass (Holistic) — README

A small end-to-end pipeline that records MediaPipe Holistic landmarks, builds fixed-length sequences, trains a bidirectional GRU classifier, evaluates it, and runs a live webcam demo that recognizes classes such as words (“Mother”, “Father”, “Go”) or letters.


Quick Start

# 0) Create class folders
./make_seq_dirs.sh Mother Father Go

# 1) Capture clips (per class; adjust counts as you like)
python capture_sequence.py --label Mother --split train --seconds 0.8 --count 100
python capture_sequence.py --label Mother --split val   --seconds 0.8 --count 20
python capture_sequence.py --label Father --split train --seconds 0.8 --count 100
python capture_sequence.py --label Father --split val   --seconds 0.8 --count 20
python capture_sequence.py --label Go     --split train --seconds 0.8 --count 100
python capture_sequence.py --label Go     --split val   --seconds 0.8 --count 20

# 2) Build fixed-length dataset (32 frames/clip)
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32

# 3) Train, evaluate, and run live inference
python train_seq.py --landmarks landmarks_seq32 --out asl_seq32_gru_mother_father_go.pt
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_mother_father_go.pt
python infer_seq_webcam.py --model asl_seq32_gru_mother_father_go.pt --threshold 0.35 --smooth 0.1

Folder layout after capture:

sequences/
  train/
    Mother/ clip_001.npz ...
    Father/ clip_001.npz ...
    Go/     clip_001.npz ...
  val/
    Mother/ ...
    Father/ ...
    Go/     ...

Feature Representation (per frame)

From MediaPipe Holistic:

  • Right hand 21×(x,y,z) → 63
  • Left hand 21×(x,y,z) → 63
  • Face 468×(x,y,z) → 1,404
  • Pose 33×(x,y,z,visibility) → 132
  • Face-relative hand extras: wrist (x,y) + index tip (x,y) for each hand, expressed in the face-normalized frame → 8 Total = 1,670 dims per frame.

Normalization (high level)

  • Hands: translate to wrist, mirror left → right, rotate so middle-finger MCP points +Y, scale by max pairwise distance.
  • Face: center at eye midpoint, scale by inter-ocular distance, rotate to align eyeline horizontally.
  • Pose: center at shoulder midpoint, scale by shoulder width, rotate shoulders horizontal.
  • Extras: per-hand wrist/tip projected into the face frame so the model retains where the hand is relative to the face (critical for signs like Mother vs Father).

How the Pipeline Works

1) make_seq_dirs.sh

Creates the directory scaffolding under sequences/ for any labels you pass (letters or words).

  • Usage: ./make_seq_dirs.sh Mother Father Go
  • Why: Keeps data organized as train/ and val/ per class.

2) capture_sequence.py

Records short clips from your webcam and saves per-frame feature vectors into compressed .npz files.

Key behaviors

  • Uses MediaPipe Holistic to extract right/left hands, full face mesh, and pose.
  • Computes normalized features + face-relative extras.
  • Writes each clip as sequences/<split>/<label>/clip_XXX.npz with an array X of shape (T, 1670).

Common flags

  • --label (string): class name (e.g., Mother, Go).
  • --split: train or val.
  • --seconds (float): clip length; 0.8s pairs well with 32 frames.
  • --count (int): how many clips to record in one run.
  • --camera, --width, --height: webcam settings.
  • --holistic-complexity (0|1|2): higher is more accurate but slower.
  • UI niceties: 3-second countdown; on-screen progress bar; optional fingertip markers.

3) prep_sequence_resampled.py

Aggregates all clip_*.npz files into a fixed-length dataset.

What it does

  • Loads each clips X (T, 1670) and linearly resamples to exactly N frames (default 32), resulting in (N, 1670).

  • Stacks clips into:

    • train_X.npy (Nclips, Nframes, F)
    • train_y.npy (Nclips,)
    • val_X.npy, val_y.npy
    • class_names.json (sorted list of class names)
    • meta.json with { "frames": N, "input_dim": F }

Flags

  • --in root of sequences/
  • --out dataset folder (e.g., landmarks_seq32)
  • --frames number of frames per clip after resampling (e.g., 16, 32, 64)

Tip: Reducing --frames (e.g., 16) lowers first-prediction latency in the live demo, at the cost of some stability/accuracy.


4) train_seq.py

Trains a bidirectional GRU classifier on the resampled sequences.

What it does

  • Loads train_*.npy / val_*.npy, class_names.json, and meta.json.

  • Computes feature-wise mean/std on the train set; normalizes train/val.

  • Model: GRU(input_dim → 128 hidden, bidirectional) → ReLU → Dropout → Linear(num_classes).

  • Tracks best val accuracy; saves a checkpoint containing:

    • model weights
    • classes, frames
    • X_mean, X_std (for inference normalization)

Flags

  • --epochs, --batch, --lr: typical training hyperparams.
  • --out: model file (e.g., asl_seq32_gru_mother_father_go.pt)

5) eval_val.py

Evaluates your saved model on the validation set.

What it does

  • Loads val_X.npy, val_y.npy, class_names.json, meta.json, and the *.pt checkpoint.
  • Normalizes val_X using the training mean/std stored in the checkpoint.
  • Prints confusion matrix and a full classification report (precision/recall/F1/accuracy).

Usage

python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_mother_father_go.pt

6) infer_seq_webcam.py

Live webcam demo that streams landmarks, builds a rolling buffer, and classifies in real time.

Key behaviors

  • Maintains a rolling window of T frames (from the models frames value; default 32).
  • No prediction until the buffer is full → expect a short warm-up.
  • Applies the same normalization using the models stored X_mean/X_std.
  • Optional EMA smoothing over probabilities for stability.
  • Example action hook included: spell “W → E → B” to open a URL.

Common flags

  • --threshold (e.g., 0.35): minimum top-class probability to “emit” a label.
  • --smooth (seconds): temporal EMA (0 disables). Lower = more responsive; higher = steadier.
  • --holistic-complexity, --det-thresh: detector accuracy/sensitivity tradeoffs.

Parameters & Practical Tips

  • Threshold vs Smooth

    • Lower --threshold (e.g., 0.30.4) → more sensitive, but may produce more false positives.
    • --smooth0.10.3s → responsive; 0.50.8s → steadier but laggier.
  • Frames (--frames in prep)

    • 1624 frames: snappier first detection.
    • 32 frames: balanced.
    • 64 frames: more context, slower to first prediction.
  • Data balance & variety

    • Similar clip counts per class help training.
    • Vary lighting, small head angles, distance, and speed of motion.
    • For location-based signs (e.g., Mother vs Father), the face-relative extras help the model disambiguate.

File-by-File Summary

File Purpose Inputs → Outputs
make_seq_dirs.sh Creates train/ and val/ subfolders for each label you pass in. Labels → sequences/train/<label>/, sequences/val/<label>/
capture_sequence.py Captures webcam frames, extracts Holistic landmarks, normalizes, builds per-frame 1,670-D features, and saves each clip. Webcam → sequences/<split>/<label>/clip_XXX.npz (X: (T,1670))
prep_sequence_resampled.py Resamples variable-length clips to fixed length; aggregates into train/val arrays and writes metadata. sequences/landmarks_seq32/{train_X,train_y,val_X,val_y}.npy, class_names.json, meta.json
train_seq.py Trains a BiGRU multiclass classifier with normalization and simple augmentation. landmarks_seq32asl_seq32_gru_*.pt (includes model, classes, frames, mean/std)
eval_val.py Evaluates the saved model on the validation split; prints metrics. Model + landmarks_seq32 → console metrics
infer_seq_webcam.py Streams webcam landmarks, builds rolling sequences, classifies in real time; optional action (e.g., open URL on sequence). Webcam + asl_seq32_gru_*.pt → on-screen predictions/actions
what_to_do.txt Step-by-step command cheat-sheet reflecting the current multi-word workflow.

Troubleshooting

  • “No classes found in sequences/train/” Ensure class folders exist: sequences/train/<Label>/ and sequences/val/<Label>/, and that they contain clip_*.npz.

  • No live prediction initially Expected; the model needs the first T frames to fill the buffer.

  • Lag or low FPS Try --holistic-complexity 0, reduce camera resolution, or use a smaller --frames and retrain.

  • Overconfident but wrong Raise --threshold, increase --smooth, or record more varied data per class (especially negatives or near-misses).


Add/Remove Classes

  • To add a class (e.g., Go): create dirs, capture clips, rerun prep, retrain, re-eval.
  • To remove/replace a class: delete its folders or rename, then rerun prep and retrain.

Dependencies

  • Python 3.x, numpy, opencv-python, mediapipe, torch, scikit-learn (for evaluation).
  • macOS with Apple Silicon can use MPS acceleration automatically (already handled in the code).

Notes

  • Labels are arbitrary strings (not restricted to AZ).
  • Features are zero-filled for missing parts in a frame (e.g., if a hand isnt detected) to keep dimensions stable.
  • The face is used as a global anchor for geometry; keeping the face visible improves robustness.