# Handshapes Multiclass (Holistic) — README

A small end-to-end pipeline that records MediaPipe **Holistic** landmarks, builds fixed-length sequences, trains a **bidirectional GRU** classifier, evaluates it, and runs a **live webcam demo** that recognizes classes such as words (“Mother”, “Father”, “Go”) or letters.

---

## Quick Start

```bash
# 0) Create class folders
./make_seq_dirs.sh Mother Father Go

# 1) Capture clips (per class; adjust counts as you like)
python capture_sequence.py --label Mother --split train --seconds 0.8 --count 100
python capture_sequence.py --label Mother --split val   --seconds 0.8 --count 20
python capture_sequence.py --label Father --split train --seconds 0.8 --count 100
python capture_sequence.py --label Father --split val   --seconds 0.8 --count 20
python capture_sequence.py --label Go     --split train --seconds 0.8 --count 100
python capture_sequence.py --label Go     --split val   --seconds 0.8 --count 20

# 2) Build fixed-length dataset (32 frames/clip)
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32

# 3) Train, evaluate, and run live inference
python train_seq.py --landmarks landmarks_seq32 --out asl_seq32_gru_mother_father_go.pt
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_mother_father_go.pt
python infer_seq_webcam.py --model asl_seq32_gru_mother_father_go.pt --threshold 0.35 --smooth 0.1
```

Folder layout after capture:

```
sequences/
  train/
    Mother/ clip_001.npz ...
    Father/ clip_001.npz ...
    Go/     clip_001.npz ...
  val/
    Mother/ ...
    Father/ ...
    Go/     ...
```

---

## Feature Representation (per frame)

From MediaPipe **Holistic**:

* **Right hand** 21×(x,y,z) → 63
* **Left hand**  21×(x,y,z) → 63
* **Face**       468×(x,y,z) → 1,404
* **Pose**       33×(x,y,z,visibility) → 132
* **Face-relative hand extras**: wrist (x,y) + index tip (x,y) for each hand, expressed in the face-normalized frame → 8
  **Total** = **1,670 dims** per frame.

### Normalization (high level)

* Hands: translate to wrist, mirror left → right, rotate so middle-finger MCP points +Y, scale by max pairwise distance.
* Face: center at eye midpoint, scale by inter-ocular distance, rotate to align eyeline horizontally.
* Pose: center at shoulder midpoint, scale by shoulder width, rotate shoulders horizontal.
* Extras: per-hand wrist/tip projected into the face frame so the model retains *where* the hand is relative to the face (critical for signs like **Mother** vs **Father**).

---

## How the Pipeline Works

### 1) `make_seq_dirs.sh`

Creates the directory scaffolding under `sequences/` for any labels you pass (letters or words).

* **Usage:** `./make_seq_dirs.sh Mother Father Go`
* **Why:** Keeps data organized as `train/` and `val/` per class.

---

### 2) `capture_sequence.py`

Records short clips from your webcam and saves per-frame **feature vectors** into compressed `.npz` files.

**Key behaviors**

* Uses **MediaPipe Holistic** to extract right/left hands, full face mesh, and pose.
* Computes normalized features + face-relative extras.
* Writes each clip as `sequences/<split>/<label>/clip_XXX.npz` with an array `X` of shape `(T, 1670)`.

**Common flags**

* `--label` (string): class name (e.g., `Mother`, `Go`).
* `--split`: `train` or `val`.
* `--seconds` (float): clip length; 0.8s pairs well with 32 frames.
* `--count` (int): how many clips to record in one run.
* `--camera`, `--width`, `--height`: webcam settings.
* `--holistic-complexity` (`0|1|2`): higher is more accurate but slower.
* UI niceties: 3-second countdown; on-screen progress bar; optional fingertip markers.

---

### 3) `prep_sequence_resampled.py`

Aggregates all `clip_*.npz` files into a fixed-length dataset.

**What it does**

* Loads each clip’s `X` `(T, 1670)` and **linearly resamples** to exactly `N` frames (default `32`), resulting in `(N, 1670)`.
* Stacks clips into:

  * `train_X.npy` `(Nclips, Nframes, F)`
  * `train_y.npy` `(Nclips,)`
  * `val_X.npy`, `val_y.npy`
  * `class_names.json` (sorted list of class names)
  * `meta.json` with `{ "frames": N, "input_dim": F }`

**Flags**

* `--in` root of `sequences/`
* `--out` dataset folder (e.g., `landmarks_seq32`)
* `--frames` number of frames per clip after resampling (e.g., `16`, `32`, `64`)

> Tip: Reducing `--frames` (e.g., 16) lowers first-prediction latency in the live demo, at the cost of some stability/accuracy.

---

### 4) `train_seq.py`

Trains a **bidirectional GRU** classifier on the resampled sequences.

**What it does**

* Loads `train_*.npy` / `val_*.npy`, `class_names.json`, and `meta.json`.
* Computes **feature-wise mean/std** on the train set; normalizes train/val.
* Model: `GRU(input_dim → 128 hidden, bidirectional) → ReLU → Dropout → Linear(num_classes)`.
* Tracks best **val accuracy**; saves a checkpoint containing:

  * `model` weights
  * `classes`, `frames`
  * `X_mean`, `X_std` (for inference normalization)

**Flags**

* `--epochs`, `--batch`, `--lr`: typical training hyperparams.
* `--out`: model file (e.g., `asl_seq32_gru_mother_father_go.pt`)

---

### 5) `eval_val.py`

Evaluates your saved model on the validation set.

**What it does**

* Loads `val_X.npy`, `val_y.npy`, `class_names.json`, `meta.json`, and the `*.pt` checkpoint.
* Normalizes `val_X` using the **training** mean/std stored in the checkpoint.
* Prints **confusion matrix** and a full **classification report** (precision/recall/F1/accuracy).

**Usage**

```bash
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_mother_father_go.pt
```

---

### 6) `infer_seq_webcam.py`

Live webcam demo that streams landmarks, builds a rolling buffer, and classifies in real time.

**Key behaviors**

* Maintains a **rolling window** of `T` frames (from the model’s `frames` value; default 32).
* No prediction until the buffer is full → expect a short warm-up.
* Applies the same normalization using the model’s stored `X_mean`/`X_std`.
* Optional **EMA smoothing** over probabilities for stability.
* Example **action hook** included: spell “W → E → B” to open a URL.

**Common flags**

* `--threshold` (e.g., `0.35`): minimum top-class probability to “emit” a label.
* `--smooth` (seconds): temporal EMA (0 disables). Lower = more responsive; higher = steadier.
* `--holistic-complexity`, `--det-thresh`: detector accuracy/sensitivity tradeoffs.

---

## Parameters & Practical Tips

* **Threshold vs Smooth**

  * Lower `--threshold` (e.g., `0.3–0.4`) → more sensitive, but may produce more false positives.
  * `--smooth` ≈ `0.1–0.3s` → responsive; `0.5–0.8s` → steadier but laggier.

* **Frames (`--frames` in prep)**

  * `16–24` frames: snappier first detection.
  * `32` frames: balanced.
  * `64` frames: more context, slower to first prediction.

* **Data balance & variety**

  * Similar clip counts per class help training.
  * Vary lighting, small head angles, distance, and speed of motion.
  * For location-based signs (e.g., Mother vs Father), the **face-relative extras** help the model disambiguate.

---

## File-by-File Summary

| File                         | Purpose                                                                                                                    | Inputs → Outputs                                                                                    |
| ---------------------------- | -------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| `make_seq_dirs.sh`           | Creates `train/` and `val/` subfolders for each label you pass in.                                                         | Labels → `sequences/train/<label>/`, `sequences/val/<label>/`                                       |
| `capture_sequence.py`        | Captures webcam frames, extracts Holistic landmarks, normalizes, builds per-frame 1,670-D features, and saves each clip.   | Webcam → `sequences/<split>/<label>/clip_XXX.npz` (X: `(T,1670)`)                                   |
| `prep_sequence_resampled.py` | Resamples variable-length clips to fixed length; aggregates into train/val arrays and writes metadata.                     | `sequences/` → `landmarks_seq32/{train_X,train_y,val_X,val_y}.npy`, `class_names.json`, `meta.json` |
| `train_seq.py`               | Trains a BiGRU multiclass classifier with normalization and simple augmentation.                                           | `landmarks_seq32` → `asl_seq32_gru_*.pt` (includes model, classes, frames, mean/std)                |
| `eval_val.py`                | Evaluates the saved model on the validation split; prints metrics.                                                         | Model + `landmarks_seq32` → console metrics                                                         |
| `infer_seq_webcam.py`        | Streams webcam landmarks, builds rolling sequences, classifies in real time; optional action (e.g., open URL on sequence). | Webcam + `asl_seq32_gru_*.pt` → on-screen predictions/actions                                       |
| `what_to_do.txt`             | Step-by-step command cheat-sheet reflecting the current multi-word workflow.                                               | —                                                                                                   |

---

## Troubleshooting

* **“No classes found in sequences/train/”**
  Ensure class folders exist: `sequences/train/<Label>/` and `sequences/val/<Label>/`, and that they contain `clip_*.npz`.

* **No live prediction initially**
  Expected; the model needs the first **T** frames to fill the buffer.

* **Lag or low FPS**
  Try `--holistic-complexity 0`, reduce camera resolution, or use a smaller `--frames` and retrain.

* **Overconfident but wrong**
  Raise `--threshold`, increase `--smooth`, or record more varied data per class (especially negatives or near-misses).

---

## Add/Remove Classes

* To **add** a class (e.g., `Go`): create dirs, capture clips, rerun **prep**, retrain, re-eval.
* To **remove/replace** a class: delete its folders or rename, **then** rerun **prep** and retrain.

---

## Dependencies

* Python 3.x, `numpy`, `opencv-python`, `mediapipe`, `torch`, `scikit-learn` (for evaluation).
* macOS with Apple Silicon can use MPS acceleration automatically (already handled in the code).

---

## Notes

* Labels are **arbitrary strings** (not restricted to A–Z).
* Features are **zero-filled** for missing parts in a frame (e.g., if a hand isn’t detected) to keep dimensions stable.
* The face is used as a global anchor for geometry; keeping the face visible improves robustness.

---