Initial commit: handshapes multiclass project

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-19 22:27:20 -05:00
commit 816e34cb17
22 changed files with 2820 additions and 0 deletions

View File

@@ -0,0 +1,216 @@
# Handshape Sequence Classifier (MediaPipe + PyTorch, macOS MPS-ready)
Live ASL handshape letter demo powered by MediaPipe Hands landmarks and a bidirectional GRU sequence model.
Record short clips per letter, resample to a fixed length, train, evaluate, and run a real-time webcam demo that can react to detected letter sequences (e.g., **W → E → B** opens a URL).
## Features
* **Data capture UI:** 3-second centered countdown + top progress bar; fingertip dot feedback.
* **Robust normalization:** wrist-anchored, left/right mirroring, rotation to +Y, scale by max pairwise distance.
* **Fixed-length preprocessing:** linear resampling to *N* frames (default **32**).
* **Sequence model:** BiGRU (128 hidden × 2) → MLP head; light augmentation during training.
* **Live inference:** EMA smoothing + thresholding; emits letters only on change; detects special sequences (**WEB**) and opens a browser.
---
## Quick Start
```bash
# 0) (optional) Create & activate a virtual env
python -m venv .venv && source .venv/bin/activate
# 1) Install deps
pip install numpy opencv-python mediapipe torch scikit-learn
# 2) Make directories for the letters youll collect
./make_seq_dirs.sh A B J Z
# 3) Capture short clips per letter (train/val)
python capture_sequence.py --label A --split train
python capture_sequence.py --label A --split val
# ...repeat for B, J, Z
# 4) Preprocess → fixed-length dataset (32 frames)
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
# 5) Train the BiGRU
python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 \
--out asl_seq32_gru_ABJZ.pt
# 6) Evaluate on the validation set (confusion matrix + report)
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_ABJZ.pt
# 7) Live webcam demo (press 'q' to quit)
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7
```
> **WEB trigger:** In the live demo, if the emitted letters form **W → E → B**, the app prints a message and opens `--url` (default: Google).
> Example: `--url https://www.gallaudet.edu`
---
## Repository Layout
```
handshapes-multiclass/
├─ make_seq_dirs.sh # creates sequences/train|val/<LETTER>/
├─ capture_sequence.py # webcam capture → clip_XXX.npz (X: (T,63), tip: (T,2))
├─ prep_sequence_resampled.py # resample clips to fixed N frames → landmarks_seq32/
├─ train_seq.py # train BiGRU; saves best checkpoint (.pt + stats)
├─ eval_val.py # evaluate on val set; prints metrics
├─ infer_seq_webcam.py # live demo; emits letters; detects "WEB" → opens URL
├─ what_to_do.txt # quick, step-by-step playbook
└─ sequences/ # created by you (after running make_seq_dirs.sh)
├─ train/<LETTER>/clip_XXX.npz
└─ val/<LETTER>/clip_XXX.npz
```
**Clip file format (`clip_XXX.npz`)**
* `X`: `(T, 63)` — per-frame normalized landmarks (21 points × (x, y, z))
* `tip`: `(T, 2)` — normalized index fingertip positions (for sanity checks)
**Prepared dataset (`landmarks_seq32/`)**
* `train_X.npy`, `train_y.npy`, `val_X.npy`, `val_y.npy`
* `class_names.json` (e.g., `["A","B","J","Z"]`)
* `meta.json` (e.g., `{"frames":32,"input_dim":63}`)
**Checkpoint (`*.pt`)**
* `model` (state_dict), `classes`, `frames`, `X_mean`, `X_std`
---
## Normalization (consistent across capture & inference)
1. Translate so **wrist** (landmark 0) is at the origin.
2. If detected **left** hand, mirror `x *= -1`.
3. Rotate so the **middle-finger MCP** (landmark 9) points along **+Y**.
4. Scale all coords by the **max pairwise distance** among 2D landmarks.
5. Flatten to **63 features** per frame.
This ensures letter-style, not camera pose, drives classification.
---
## Training Details
* **Model:** BiGRU (input=63, hidden=128, bidirectional) → `[Linear(256→128), ReLU, Dropout(0.2), Linear(128→num_classes)]`
* **Optimizer:** AdamW (`lr=1e-3`, `weight_decay=1e-4`)
* **Scheduler:** CosineAnnealingLR (`T_max = epochs`)
* **Augmentation:** small 2D rotate (±7°), scale (±10%), Gaussian noise (σ=0.01)
* **Normalization:** global `X_mean`/`X_std` computed over **train** (time+batch), applied to both train & val and saved into the checkpoint.
---
## Live Inference Behavior
* Maintains a rolling buffer of **T = frames** (from the checkpoint).
* Applies the saved `X_mean`/`X_std`.
* **EMA smoothing** over softmax probs with time constant `--smooth` (seconds).
* Emits a letter only if:
* top prob ≥ `--threshold` (e.g., 0.8), **and**
* the letter **changed** from the previous emission (prevents repeats).
* Tracks a short history of emitted letters to detect **W → E → B**; on match:
* prints “Detected WEB! …”
* calls `webbrowser.open(--url)`
**Common flags**
```bash
# Camera & size
--camera 0 --width 640 --height 480
# Confidence vs. latency tradeoffs
--threshold 0.85 # higher → fewer false positives
--smooth 1.0 # higher → steadier output but more lag
# Action on sequence
--url https://example.com
```
---
## Tips for High Accuracy
* Record **balanced** train/val counts per class (e.g., 100 train / 20 val).
* Keep the hand **centered**, well lit, and mostly **single-hand** (model expects 1 hand).
* Maintain consistent **distance** and **orientation** during capture.
* If you add new letters later, just record them, re-run preprocessing, and retrain — classes are **auto-discovered** from `sequences/train/*`.
---
## macOS (M-series) Notes
* PyTorch will automatically use **Metal (MPS)** if available (`torch.backends.mps.is_available()`); otherwise CPU.
* If the webcam feed looks low FPS, try reducing `--width/--height` or raising `--threshold` / `--smooth`.
---
## Troubleshooting
* **“Could not open camera”** → try `--camera 1` (or check macOS camera permission).
* **No detections / “No hand” on screen** → improve lighting, ensure a single clear hand, check MediaPipe install.
* **Model emits wrong letters** → increase `--threshold`, collect more data, or raise `--smooth`.
* **Mismatch T during inference** → ensure `--frames` at preprocessing matches the checkpoints `frames` (saved & auto-used).
---
## Commands Reference
### Create class folders
```bash
./make_seq_dirs.sh A B J Z
```
### Capture clips
```bash
python capture_sequence.py --label A --split train --seconds 0.8 --count 100
python capture_sequence.py --label A --split val --seconds 0.8 --count 20
```
### Prepare dataset (resample to 32 frames)
```bash
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
```
### Train
```bash
python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 \
--out asl_seq32_gru_ABJZ.pt
```
### Evaluate
```bash
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_ABJZ.pt
```
### Live demo (open URL on “WEB”)
```bash
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7 \
--url https://www.gallaudet.edu
```
---
## License
MIT
---
## Acknowledgments
* **MediaPipe Hands** for robust, fast hand landmark detection.
* **PyTorch** for flexible sequence modeling on CPU/MPS.
---

View File

@@ -0,0 +1,176 @@
#!/usr/bin/env python3
# capture_sequence.py
# Automatically record N short sequences for each label (default: 100 train / 20 val)
# Centered 3-second countdown before recording.
# Per-clip depleting progress bar (full → empty) across the top during capture.
import argparse, os, time, math, re
from pathlib import Path
import numpy as np, cv2, mediapipe as mp
def normalize_frame(pts, handed=None):
pts = pts.astype(np.float32).copy()
pts[:, :2] -= pts[0, :2]
if handed and handed.lower().startswith("left"):
pts[:, 0] *= -1.0
v = pts[9, :2]
ang = math.atan2(v[1], v[0])
c, s = math.cos(math.pi/2 - ang), math.sin(math.pi/2 - ang)
R = np.array([[c, -s], [s, c]], np.float32)
pts[:, :2] = pts[:, :2] @ R.T
xy = pts[:, :2]
d = np.max(np.linalg.norm(xy[None,:,:] - xy[:,None,:], axis=-1))
if d < 1e-6: d = 1.0
pts[:, :2] /= d; pts[:, 2] /= d
return pts
def next_idx(folder: Path, prefix="clip_"):
pat = re.compile(rf"^{re.escape(prefix)}(\d+)\.npz$")
mx = 0
if folder.exists():
for n in os.listdir(folder):
m = pat.match(n)
if m: mx = max(mx, int(m.group(1)))
return mx + 1
def countdown(cap, seconds=3):
"""Display a centered countdown before starting capture."""
for i in range(seconds, 0, -1):
start = time.time()
while time.time() - start < 1.0:
ok, frame = cap.read()
if not ok:
continue
h, w = frame.shape[:2]
# Main big number in center
text = str(i)
font_scale = 5
thickness = 10
(tw, th), _ = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, font_scale, thickness)
cv2.putText(frame, text,
((w - tw)//2, (h + th)//2),
cv2.FONT_HERSHEY_SIMPLEX, font_scale, (0,0,255), thickness, cv2.LINE_AA)
# Smaller message above
msg = "Starting in..."
font_scale_msg = 1.2
thickness_msg = 3
(mw, mh), _ = cv2.getTextSize(msg, cv2.FONT_HERSHEY_SIMPLEX, font_scale_msg, thickness_msg)
cv2.putText(frame, msg,
((w - mw)//2, (h//2) - th - 20),
cv2.FONT_HERSHEY_SIMPLEX, font_scale_msg, (0,255,255), thickness_msg, cv2.LINE_AA)
cv2.imshow("sequence capture", frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
cap.release(); cv2.destroyAllWindows(); raise SystemExit("Aborted during countdown")
def draw_progress_bar(img, frac_remaining, bar_h=16, margin=12):
"""
Draw a top progress bar that starts full and depletes to empty.
frac_remaining: 1.0 at start → 0.0 at end.
"""
h, w = img.shape[:2]
x0, x1 = margin, w - margin
y0, y1 = margin, margin + bar_h
# Background bar
cv2.rectangle(img, (x0, y0), (x1, y1), (40, 40, 40), -1) # dark gray
cv2.rectangle(img, (x0, y0), (x1, y1), (90, 90, 90), 2) # border
# Foreground (remaining)
rem_w = int((x1 - x0) * max(0.0, min(1.0, frac_remaining)))
if rem_w > 0:
cv2.rectangle(img, (x0, y0), (x0 + rem_w, y1), (0, 200, 0), -1) # green
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--label", required=True, help="Letter label (A..Z)")
ap.add_argument("--split", required=True, choices=["train","val"])
ap.add_argument("--seconds", type=float, default=0.8, help="Clip length (s)")
ap.add_argument("--camera", type=int, default=0)
ap.add_argument("--width", type=int, default=640)
ap.add_argument("--height", type=int, default=480)
ap.add_argument("--count", type=int, default=None,
help="How many clips (default=100 train, 20 val)")
args = ap.parse_args()
if args.count is None:
args.count = 100 if args.split == "train" else 20
L = args.label.upper().strip()
if not (len(L) == 1 and "A" <= L <= "Z"):
raise SystemExit("Use --label A..Z")
out_dir = Path("sequences") / args.split / L
out_dir.mkdir(parents=True, exist_ok=True)
idx = next_idx(out_dir)
hands = mp.solutions.hands.Hands(
static_image_mode=False, max_num_hands=1, min_detection_confidence=0.5
)
cap = cv2.VideoCapture(args.camera)
if not cap.isOpened():
raise SystemExit(f"Could not open camera {args.camera}")
cap.set(cv2.CAP_PROP_FRAME_WIDTH, args.width)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, args.height)
print(f"Recording {args.count} clips for {L}/{args.split}, {args.seconds}s each.")
countdown(cap, 3)
for n in range(args.count):
seq_X, seq_tip = [], []
start_t = time.time()
end_t = start_t + args.seconds
while True:
now = time.time()
if now >= end_t:
break
ok, fr = cap.read()
if not ok:
break
rgb = cv2.cvtColor(fr, cv2.COLOR_BGR2RGB)
res = hands.process(rgb)
if res.multi_hand_landmarks:
ih = res.multi_hand_landmarks[0]
handed = None
if res.multi_handedness:
handed = res.multi_handedness[0].classification[0].label
pts = np.array([[lm.x, lm.y, lm.z] for lm in ih.landmark], np.float32)
pts = normalize_frame(pts, handed)
seq_X.append(pts.reshape(-1))
seq_tip.append(pts[8, :2])
# draw fingertip marker (for feedback)
cv2.circle(fr,
(int(fr.shape[1] * pts[8, 0]), int(fr.shape[0] * pts[8, 1])),
6, (0, 255, 0), -1)
# overlay progress + status
frac_remaining = (end_t - now) / max(1e-6, args.seconds) # 1 → 0
draw_progress_bar(fr, frac_remaining, bar_h=16, margin=12)
cv2.putText(fr, f"{L} {args.split} Clip {n+1}/{args.count}",
(20, 40), cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0,255,0), 2, cv2.LINE_AA)
cv2.imshow("sequence capture", fr)
if cv2.waitKey(1) & 0xFF == ord('q'):
cap.release(); cv2.destroyAllWindows(); return
if seq_X:
X = np.stack(seq_X, 0)
tip = np.stack(seq_tip, 0)
path = out_dir / f"clip_{idx:03d}.npz"
np.savez_compressed(path, X=X, tip=tip)
print(f"💾 saved {path} frames={X.shape[0]}")
idx += 1
else:
print("⚠️ No hand detected; skipped clip.")
print("✅ Done recording.")
cap.release(); cv2.destroyAllWindows()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,60 @@
#!/usr/bin/env python3
# eval_seq_val.py
import os, json, argparse
import numpy as np
import torch, torch.nn as nn
from sklearn.metrics import classification_report, confusion_matrix
class SeqGRU(nn.Module):
def __init__(self, input_dim=63, hidden=128, num_classes=26):
super().__init__()
self.gru = nn.GRU(input_dim, hidden, batch_first=True, bidirectional=True)
self.head = nn.Sequential(
nn.Linear(hidden*2, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, num_classes),
)
def forward(self, x):
h,_ = self.gru(x)
h_last = h[:, -1, :]
return self.head(h_last)
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--landmarks", default="landmarks_seq32")
ap.add_argument("--model", default="asl_seq32_gru_ABJZ.pt")
args = ap.parse_args()
vaX = np.load(os.path.join(args.landmarks,"val_X.npy")) # (N, T, 63)
vaY = np.load(os.path.join(args.landmarks,"val_y.npy"))
classes = json.load(open(os.path.join(args.landmarks,"class_names.json")))
meta = json.load(open(os.path.join(args.landmarks,"meta.json")))
T = int(meta.get("frames", 32))
state = torch.load(args.model, map_location="cpu", weights_only=False)
X_mean, X_std = state["X_mean"], state["X_std"]
if isinstance(X_mean, torch.Tensor): X_mean = X_mean.numpy()
if isinstance(X_std, torch.Tensor): X_std = X_std.numpy()
X_mean = X_mean.astype(np.float32)
X_std = (X_std.astype(np.float32) + 1e-6)
vaXn = (vaX - X_mean) / X_std
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
model = SeqGRU(63, 128, num_classes=len(classes))
model.load_state_dict(state["model"])
model.eval().to(device)
with torch.no_grad():
xb = torch.from_numpy(vaXn).float().to(device)
logits = model(xb)
pred = logits.argmax(1).cpu().numpy()
cm = confusion_matrix(vaY, pred)
print("Classes:", classes)
print("\nConfusion matrix (rows=true, cols=pred):\n", cm)
print("\nReport:\n", classification_report(vaY, pred, target_names=classes))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,198 @@
#!/usr/bin/env python3
"""
infer_seq_webcam.py
Live webcam demo: detect a hand with MediaPipe, normalize landmarks,
classify with a trained sequence GRU model (multiclass).
Examples:
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.85 --smooth 1.0 --url https://www.google.com
"""
import os, math, argparse, time, webbrowser
import numpy as np
import cv2
import torch
import mediapipe as mp
# --- Quiet logs ---
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
os.environ["GLOG_minloglevel"] = "2"
import absl.logging
absl.logging.set_verbosity(absl.logging.ERROR)
cv2.setLogLevel(0)
# ---------- geometry helpers ----------
def _angle(v): return math.atan2(v[1], v[0])
def _rot2d(t):
c, s = math.cos(t), math.sin(t)
return np.array([[c, -s], [s, c]], dtype=np.float32)
def normalize_landmarks(pts, handedness_label=None):
"""
pts: (21,3) MediaPipe normalized coords in [0..1]
Steps: translate wrist->origin, mirror left to right, rotate to +Y, scale by max pairwise distance.
Returns: (63,) float32
"""
pts = pts.astype(np.float32).copy()
pts[:, :2] -= pts[0, :2]
if handedness_label and handedness_label.lower().startswith("left"):
pts[:, 0] *= -1.0
v = pts[9, :2] # middle MCP
R = _rot2d(math.pi/2 - _angle(v))
pts[:, :2] = pts[:, :2] @ R.T
xy = pts[:, :2]
d = np.linalg.norm(xy[None,:,:] - xy[:,None,:], axis=-1).max()
d = 1.0 if d < 1e-6 else float(d)
pts[:, :2] /= d; pts[:, 2] /= d
return pts.reshape(-1)
# ---------- sequence model ----------
class SeqGRU(torch.nn.Module):
def __init__(self, input_dim=63, hidden=128, num_classes=26):
super().__init__()
self.gru = torch.nn.GRU(input_dim, hidden, batch_first=True, bidirectional=True)
self.head = torch.nn.Sequential(
torch.nn.Linear(hidden*2, 128),
torch.nn.ReLU(),
torch.nn.Dropout(0.2),
torch.nn.Linear(128, num_classes),
)
def forward(self, x):
h, _ = self.gru(x) # (B,T,2H)
h_last = h[:, -1, :] # or h.mean(1)
return self.head(h_last)
# ---------- main ----------
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--model", required=True, help="Path to trained .pt model")
ap.add_argument("--camera", type=int, default=0)
ap.add_argument("--threshold", type=float, default=0.8)
ap.add_argument("--smooth", type=float, default=0.7,
help="EMA smoothing window in seconds (0 disables smoothing)")
ap.add_argument("--width", type=int, default=640)
ap.add_argument("--height", type=int, default=480)
ap.add_argument("--url", type=str, default="https://www.google.com",
help="URL to open when the sequence W→E→B is detected")
args = ap.parse_args()
if not os.path.exists(args.model):
raise SystemExit(f"❌ Model file not found: {args.model}")
# Load checkpoint (support numpy or tensor stats; support 'frames' if present)
state = torch.load(args.model, map_location="cpu", weights_only=False)
classes = state["classes"]
T = int(state.get("frames", 32))
X_mean, X_std = state["X_mean"], state["X_std"]
if isinstance(X_mean, torch.Tensor): X_mean = X_mean.cpu().numpy()
if isinstance(X_std, torch.Tensor): X_std = X_std.cpu().numpy()
X_mean = X_mean.astype(np.float32)
X_std = (X_std.astype(np.float32) + 1e-6)
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
model = SeqGRU(63, 128, num_classes=len(classes)).to(device)
model.load_state_dict(state["model"])
model.eval()
hands = mp.solutions.hands.Hands(
static_image_mode=False, max_num_hands=1, min_detection_confidence=0.5
)
cap = cv2.VideoCapture(args.camera)
if not cap.isOpened():
raise SystemExit(f"❌ Could not open camera index {args.camera}")
cap.set(cv2.CAP_PROP_FRAME_WIDTH, args.width)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, args.height)
print(f"✅ Loaded {args.model} frames={T} classes={classes}")
print("Press 'q' to quit.")
seq_buffer, ema_probs = [], None
last_ts = time.time()
last_emitted_letter = None
# Rolling history of emitted letters to detect the sequence "WEB"
detected_history = [] # only stores emitted letters (deduped by change)
while True:
ok, frame = cap.read()
if not ok: break
now = time.time()
dt = max(1e-6, now - last_ts)
last_ts = now
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
res = hands.process(rgb)
overlay_text = "No hand"
current_letter = None
if res.multi_hand_landmarks:
ih = res.multi_hand_landmarks[0]
handed = None
if res.multi_handedness:
handed = res.multi_handedness[0].classification[0].label
pts = np.array([[lm.x, lm.y, lm.z] for lm in ih.landmark], dtype=np.float32)
feat = normalize_landmarks(pts, handedness_label=handed)
seq_buffer.append(feat)
if len(seq_buffer) > T: seq_buffer.pop(0)
if len(seq_buffer) == T:
X = np.stack(seq_buffer, 0)
Xn = (X - X_mean) / X_std
xt = torch.from_numpy(Xn).float().unsqueeze(0).to(device)
with torch.no_grad():
logits = model(xt)
probs = torch.softmax(logits, dim=1)[0].cpu().numpy()
if args.smooth > 0:
alpha = 1.0 - math.exp(-dt / args.smooth)
if ema_probs is None: ema_probs = probs
else: ema_probs = (1.0 - alpha) * ema_probs + alpha * probs
use_probs = ema_probs
else:
use_probs = probs
top_idx = int(np.argmax(use_probs))
top_p = float(use_probs[top_idx])
top_cls = classes[top_idx]
if top_p >= args.threshold:
overlay_text = f"{top_cls} {top_p*100:.1f}%"
current_letter = top_cls
else:
seq_buffer, ema_probs = [], None
# Only emit when a *letter* changes (ignore no-hand and repeats)
if current_letter is not None and current_letter != last_emitted_letter:
print(f"Detected: {current_letter}")
last_emitted_letter = current_letter
# Update rolling history
detected_history.append(current_letter)
if len(detected_history) > 3:
detected_history.pop(0)
# Check for special sequence "WEB"
if detected_history == ["W", "E", "B"]:
print("🚀 Detected WEB! Time to open the web browser app.")
try:
webbrowser.open(args.url)
except Exception as e:
print(f"⚠️ Failed to open browser: {e}")
detected_history.clear() # fire once per occurrence
# On-screen overlay (still shows "No hand" when nothing is detected)
cv2.putText(frame, overlay_text, (20, 40),
cv2.FONT_HERSHEY_SIMPLEX, 1.1, (0,255,0), 2)
cv2.imshow("ASL sequence demo", frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,19 @@
#!/usr/bin/env bash
# Create sequences/<train|val>/<LETTER>/ for the given letters.
# Example: ./make_seq_dirs.sh A B J Z
set -euo pipefail
if [ "$#" -lt 1 ]; then
echo "Usage: $0 LETTER [LETTER ...] e.g. $0 A B J Z"
exit 1
fi
ROOT="sequences"
for SPLIT in train val; do
for L in "$@"; do
mkdir -p "$ROOT/$SPLIT/$L"
done
done
echo "✅ Created $ROOT/train and $ROOT/val for: $*"

View File

@@ -0,0 +1,71 @@
#!/usr/bin/env python3
# prep_sequence_resampled.py
# Build a fixed-length (N frames) multiclass dataset from sequences/<split>/<CLASS>/clip_*.npz
import argparse, os, glob, json
from pathlib import Path
import numpy as np
def resample_sequence(X, N=32):
# X: (T,63) -> (N,63) by linear interpolation along frame index
T = len(X)
if T == 0:
return np.zeros((N, X.shape[1]), np.float32)
if T == 1:
return np.repeat(X, N, axis=0)
src = np.linspace(0, T-1, num=T)
dst = np.linspace(0, T-1, num=N)
out = np.zeros((N, X.shape[1]), np.float32)
for d in range(X.shape[1]):
out[:, d] = np.interp(dst, src, X[:, d])
return out.astype(np.float32)
def load_classes(seq_root: Path):
# classes are subdirs in sequences/train/
classes = sorted([p.name for p in (seq_root/"train").iterdir() if p.is_dir()])
classes = [c for c in classes if len(c)==1 and "A"<=c<="Z"]
if not classes:
raise SystemExit("No letter classes found in sequences/train/")
return classes
def collect_split(seq_root: Path, split: str, classes, N):
Xs, ys = [], []
for ci, cls in enumerate(classes):
for f in sorted(glob.glob(str(seq_root/split/cls/"clip_*.npz"))):
d = np.load(f)
Xi = d["X"].astype(np.float32) # (T,63)
XiN = resample_sequence(Xi, N) # (N,63)
Xs.append(XiN); ys.append(ci)
if Xs:
X = np.stack(Xs, 0)
y = np.array(ys, np.int64)
else:
X = np.zeros((0, N, 63), np.float32); y = np.zeros((0,), np.int64)
return X, y
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--in", dest="in_dir", default="sequences", help="Root sequences/ with train/ and val/")
ap.add_argument("--out", default="landmarks_seq32", help="Output folder with npy files")
ap.add_argument("--frames", type=int, default=32, help="Frames per clip after resampling (default: 32)")
args = ap.parse_args()
seq_root = Path(args.in_dir)
outdir = Path(args.out)
outdir.mkdir(parents=True, exist_ok=True)
classes = load_classes(seq_root)
trX, trY = collect_split(seq_root, "train", classes, args.frames)
vaX, vaY = collect_split(seq_root, "val", classes, args.frames)
np.save(outdir/"train_X.npy", trX)
np.save(outdir/"train_y.npy", trY)
np.save(outdir/"val_X.npy", vaX)
np.save(outdir/"val_y.npy", vaY)
json.dump(classes, open(outdir/"class_names.json", "w"))
json.dump({"frames": args.frames, "input_dim": 63}, open(outdir/"meta.json","w"))
print(f"Saved dataset → {outdir}")
print(f" train {trX.shape}, val {vaX.shape}, classes={classes}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,136 @@
#!/usr/bin/env python3
# train_seq.py
import os, json, argparse
import numpy as np
import torch, torch.nn as nn
from torch.utils.data import Dataset, DataLoader
def get_device():
return torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
class SeqDataset(Dataset):
def __init__(self, X, y, augment=False):
self.X = X.astype(np.float32) # (Nclip, T, 63)
self.y = y.astype(np.int64)
self.augment = augment
def __len__(self): return len(self.y)
def _augment(self, seq): # seq: (T,63)
T = seq.shape[0]
pts = seq.reshape(T, 21, 3).copy()
# small 2D rotation (±7°) + scale (±10%) + Gaussian noise (σ=0.01)
ang = np.deg2rad(np.random.uniform(-7, 7))
c, s = np.cos(ang), np.sin(ang)
R = np.array([[c,-s],[s,c]], np.float32)
scale = np.random.uniform(0.9, 1.1)
pts[:, :, :2] = (pts[:, :, :2] @ R.T) * scale
pts += np.random.normal(0, 0.01, size=pts.shape).astype(np.float32)
return pts.reshape(T, 63)
def __getitem__(self, i):
xi = self.X[i]
if self.augment:
xi = self._augment(xi)
return torch.from_numpy(xi).float(), int(self.y[i])
class SeqGRU(nn.Module):
def __init__(self, input_dim=63, hidden=128, num_classes=26):
super().__init__()
self.gru = nn.GRU(input_dim, hidden, batch_first=True, bidirectional=True)
self.head = nn.Sequential(
nn.Linear(hidden*2, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, num_classes),
)
def forward(self, x): # x: (B,T,63)
h,_ = self.gru(x) # (B,T,2H)
h_last = h[:, -1, :] # or mean over time: h.mean(1)
return self.head(h_last)
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--landmarks", default="landmarks_seq32", help="Folder from prep_sequence_resampled.py")
ap.add_argument("--epochs", type=int, default=40)
ap.add_argument("--batch", type=int, default=64)
ap.add_argument("--lr", type=float, default=1e-3)
ap.add_argument("--out", default="asl_seq32_gru.pt")
args = ap.parse_args()
# Load dataset
trX = np.load(os.path.join(args.landmarks,"train_X.npy")) # (N, T, 63)
trY = np.load(os.path.join(args.landmarks,"train_y.npy"))
vaX = np.load(os.path.join(args.landmarks,"val_X.npy"))
vaY = np.load(os.path.join(args.landmarks,"val_y.npy"))
classes = json.load(open(os.path.join(args.landmarks,"class_names.json")))
meta = json.load(open(os.path.join(args.landmarks,"meta.json")))
T = int(meta["frames"])
print(f"Loaded: train {trX.shape} val {vaX.shape} classes={classes}")
# Global mean/std over train (time+batch)
X_mean = trX.reshape(-1, trX.shape[-1]).mean(axis=0, keepdims=True).astype(np.float32) # (1,63)
X_std = trX.reshape(-1, trX.shape[-1]).std(axis=0, keepdims=True).astype(np.float32) + 1e-6
trXn = (trX - X_mean) / X_std
vaXn = (vaX - X_mean) / X_std
tr_ds = SeqDataset(trXn, trY, augment=True)
va_ds = SeqDataset(vaXn, vaY, augment=False)
tr_dl = DataLoader(tr_ds, batch_size=args.batch, shuffle=True)
va_dl = DataLoader(va_ds, batch_size=args.batch, shuffle=False)
device = get_device()
model = SeqGRU(input_dim=63, hidden=128, num_classes=len(classes)).to(device)
crit = nn.CrossEntropyLoss()
opt = torch.optim.AdamW(model.parameters(), lr=args.lr, weight_decay=1e-4)
sch = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=args.epochs)
best_acc, best_state = 0.0, None
for epoch in range(1, args.epochs+1):
# Train
model.train()
tot, correct, loss_sum = 0, 0, 0.0
for xb, yb in tr_dl:
xb, yb = xb.to(device), yb.to(device)
opt.zero_grad(set_to_none=True)
logits = model(xb)
loss = crit(logits, yb)
loss.backward()
opt.step()
loss_sum += loss.item() * yb.size(0)
correct += (logits.argmax(1)==yb).sum().item()
tot += yb.size(0)
tr_loss = loss_sum / max(1, tot)
tr_acc = correct / max(1, tot)
# Validate
model.eval()
vtot, vcorrect = 0, 0
with torch.no_grad():
for xb, yb in va_dl:
xb, yb = xb.to(device), yb.to(device)
logits = model(xb)
vcorrect += (logits.argmax(1)==yb).sum().item()
vtot += yb.size(0)
va_acc = vcorrect / max(1, vtot)
sch.step()
print(f"Epoch {epoch:02d}: train_loss={tr_loss:.4f} train_acc={tr_acc:.3f} val_acc={va_acc:.3f}")
if va_acc > best_acc:
best_acc = va_acc
best_state = {
"model": model.state_dict(),
"classes": classes,
"frames": T,
"X_mean": torch.from_numpy(X_mean), # tensors → future-proof
"X_std": torch.from_numpy(X_std),
}
torch.save(best_state, args.out)
print(f" ✅ Saved best → {args.out} (val_acc={best_acc:.3f})")
print("Done. Best val_acc:", best_acc)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,24 @@
# 1) Create dirs
# ./make_seq_dirs.sh A B J Z
# 2) Capture clips (0.8s each by default)
python capture_sequence.py --label A --split train
python capture_sequence.py --label A --split val
python capture_sequence.py --label B --split train
python capture_sequence.py --label B --split val
python capture_sequence.py --label J --split train
python capture_sequence.py --label J --split val
python capture_sequence.py --label Z --split train
python capture_sequence.py --label Z --split val
# 3) Preprocess to 32 frames (auto-picks classes from sequences/train/*)
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
# 4) Train GRU (multiclass on A/B/J/Z)
python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 --out asl_seq32_gru_ABJZ.pt
# 5) Live inference
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.6 --smooth 0.2
# If you later add more letters (e.g., C, D),
# just create those folders, record clips, re-run the prep step, then train again — the pipeline will include whatever letters exist under sequences/train/.