Initial commit: handshapes multiclass project
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
216
first_attempt_landmark_hands/README.md
Normal file
216
first_attempt_landmark_hands/README.md
Normal file
@@ -0,0 +1,216 @@
|
||||
# Handshape Sequence Classifier (MediaPipe + PyTorch, macOS MPS-ready)
|
||||
|
||||
Live ASL handshape letter demo powered by MediaPipe Hands landmarks and a bidirectional GRU sequence model.
|
||||
Record short clips per letter, resample to a fixed length, train, evaluate, and run a real-time webcam demo that can react to detected letter sequences (e.g., **W → E → B** opens a URL).
|
||||
|
||||
## Features
|
||||
|
||||
* **Data capture UI:** 3-second centered countdown + top progress bar; fingertip dot feedback.
|
||||
* **Robust normalization:** wrist-anchored, left/right mirroring, rotation to +Y, scale by max pairwise distance.
|
||||
* **Fixed-length preprocessing:** linear resampling to *N* frames (default **32**).
|
||||
* **Sequence model:** BiGRU (128 hidden × 2) → MLP head; light augmentation during training.
|
||||
* **Live inference:** EMA smoothing + thresholding; emits letters only on change; detects special sequences (**WEB**) and opens a browser.
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 0) (optional) Create & activate a virtual env
|
||||
python -m venv .venv && source .venv/bin/activate
|
||||
|
||||
# 1) Install deps
|
||||
pip install numpy opencv-python mediapipe torch scikit-learn
|
||||
|
||||
# 2) Make directories for the letters you’ll collect
|
||||
./make_seq_dirs.sh A B J Z
|
||||
|
||||
# 3) Capture short clips per letter (train/val)
|
||||
python capture_sequence.py --label A --split train
|
||||
python capture_sequence.py --label A --split val
|
||||
# ...repeat for B, J, Z
|
||||
|
||||
# 4) Preprocess → fixed-length dataset (32 frames)
|
||||
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
|
||||
|
||||
# 5) Train the BiGRU
|
||||
python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 \
|
||||
--out asl_seq32_gru_ABJZ.pt
|
||||
|
||||
# 6) Evaluate on the validation set (confusion matrix + report)
|
||||
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_ABJZ.pt
|
||||
|
||||
# 7) Live webcam demo (press 'q' to quit)
|
||||
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7
|
||||
```
|
||||
|
||||
> **WEB trigger:** In the live demo, if the emitted letters form **W → E → B**, the app prints a message and opens `--url` (default: Google).
|
||||
> Example: `--url https://www.gallaudet.edu`
|
||||
|
||||
---
|
||||
|
||||
## Repository Layout
|
||||
|
||||
```
|
||||
handshapes-multiclass/
|
||||
├─ make_seq_dirs.sh # creates sequences/train|val/<LETTER>/
|
||||
├─ capture_sequence.py # webcam capture → clip_XXX.npz (X: (T,63), tip: (T,2))
|
||||
├─ prep_sequence_resampled.py # resample clips to fixed N frames → landmarks_seq32/
|
||||
├─ train_seq.py # train BiGRU; saves best checkpoint (.pt + stats)
|
||||
├─ eval_val.py # evaluate on val set; prints metrics
|
||||
├─ infer_seq_webcam.py # live demo; emits letters; detects "WEB" → opens URL
|
||||
├─ what_to_do.txt # quick, step-by-step playbook
|
||||
└─ sequences/ # created by you (after running make_seq_dirs.sh)
|
||||
├─ train/<LETTER>/clip_XXX.npz
|
||||
└─ val/<LETTER>/clip_XXX.npz
|
||||
```
|
||||
|
||||
**Clip file format (`clip_XXX.npz`)**
|
||||
|
||||
* `X`: `(T, 63)` — per-frame normalized landmarks (21 points × (x, y, z))
|
||||
* `tip`: `(T, 2)` — normalized index fingertip positions (for sanity checks)
|
||||
|
||||
**Prepared dataset (`landmarks_seq32/`)**
|
||||
|
||||
* `train_X.npy`, `train_y.npy`, `val_X.npy`, `val_y.npy`
|
||||
* `class_names.json` (e.g., `["A","B","J","Z"]`)
|
||||
* `meta.json` (e.g., `{"frames":32,"input_dim":63}`)
|
||||
|
||||
**Checkpoint (`*.pt`)**
|
||||
|
||||
* `model` (state_dict), `classes`, `frames`, `X_mean`, `X_std`
|
||||
|
||||
---
|
||||
|
||||
## Normalization (consistent across capture & inference)
|
||||
|
||||
1. Translate so **wrist** (landmark 0) is at the origin.
|
||||
2. If detected **left** hand, mirror `x *= -1`.
|
||||
3. Rotate so the **middle-finger MCP** (landmark 9) points along **+Y**.
|
||||
4. Scale all coords by the **max pairwise distance** among 2D landmarks.
|
||||
5. Flatten to **63 features** per frame.
|
||||
|
||||
This ensures letter-style, not camera pose, drives classification.
|
||||
|
||||
---
|
||||
|
||||
## Training Details
|
||||
|
||||
* **Model:** BiGRU (input=63, hidden=128, bidirectional) → `[Linear(256→128), ReLU, Dropout(0.2), Linear(128→num_classes)]`
|
||||
* **Optimizer:** AdamW (`lr=1e-3`, `weight_decay=1e-4`)
|
||||
* **Scheduler:** CosineAnnealingLR (`T_max = epochs`)
|
||||
* **Augmentation:** small 2D rotate (±7°), scale (±10%), Gaussian noise (σ=0.01)
|
||||
* **Normalization:** global `X_mean`/`X_std` computed over **train** (time+batch), applied to both train & val and saved into the checkpoint.
|
||||
|
||||
---
|
||||
|
||||
## Live Inference Behavior
|
||||
|
||||
* Maintains a rolling buffer of **T = frames** (from the checkpoint).
|
||||
* Applies the saved `X_mean`/`X_std`.
|
||||
* **EMA smoothing** over softmax probs with time constant `--smooth` (seconds).
|
||||
* Emits a letter only if:
|
||||
|
||||
* top prob ≥ `--threshold` (e.g., 0.8), **and**
|
||||
* the letter **changed** from the previous emission (prevents repeats).
|
||||
* Tracks a short history of emitted letters to detect **W → E → B**; on match:
|
||||
|
||||
* prints “Detected WEB! …”
|
||||
* calls `webbrowser.open(--url)`
|
||||
|
||||
**Common flags**
|
||||
|
||||
```bash
|
||||
# Camera & size
|
||||
--camera 0 --width 640 --height 480
|
||||
|
||||
# Confidence vs. latency tradeoffs
|
||||
--threshold 0.85 # higher → fewer false positives
|
||||
--smooth 1.0 # higher → steadier output but more lag
|
||||
|
||||
# Action on sequence
|
||||
--url https://example.com
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tips for High Accuracy
|
||||
|
||||
* Record **balanced** train/val counts per class (e.g., 100 train / 20 val).
|
||||
* Keep the hand **centered**, well lit, and mostly **single-hand** (model expects 1 hand).
|
||||
* Maintain consistent **distance** and **orientation** during capture.
|
||||
* If you add new letters later, just record them, re-run preprocessing, and retrain — classes are **auto-discovered** from `sequences/train/*`.
|
||||
|
||||
---
|
||||
|
||||
## macOS (M-series) Notes
|
||||
|
||||
* PyTorch will automatically use **Metal (MPS)** if available (`torch.backends.mps.is_available()`); otherwise CPU.
|
||||
* If the webcam feed looks low FPS, try reducing `--width/--height` or raising `--threshold` / `--smooth`.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
* **“Could not open camera”** → try `--camera 1` (or check macOS camera permission).
|
||||
* **No detections / “No hand” on screen** → improve lighting, ensure a single clear hand, check MediaPipe install.
|
||||
* **Model emits wrong letters** → increase `--threshold`, collect more data, or raise `--smooth`.
|
||||
* **Mismatch T during inference** → ensure `--frames` at preprocessing matches the checkpoint’s `frames` (saved & auto-used).
|
||||
|
||||
---
|
||||
|
||||
## Commands Reference
|
||||
|
||||
### Create class folders
|
||||
|
||||
```bash
|
||||
./make_seq_dirs.sh A B J Z
|
||||
```
|
||||
|
||||
### Capture clips
|
||||
|
||||
```bash
|
||||
python capture_sequence.py --label A --split train --seconds 0.8 --count 100
|
||||
python capture_sequence.py --label A --split val --seconds 0.8 --count 20
|
||||
```
|
||||
|
||||
### Prepare dataset (resample to 32 frames)
|
||||
|
||||
```bash
|
||||
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
|
||||
```
|
||||
|
||||
### Train
|
||||
|
||||
```bash
|
||||
python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 \
|
||||
--out asl_seq32_gru_ABJZ.pt
|
||||
```
|
||||
|
||||
### Evaluate
|
||||
|
||||
```bash
|
||||
python eval_val.py --landmarks landmarks_seq32 --model asl_seq32_gru_ABJZ.pt
|
||||
```
|
||||
|
||||
### Live demo (open URL on “WEB”)
|
||||
|
||||
```bash
|
||||
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7 \
|
||||
--url https://www.gallaudet.edu
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
---
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
* **MediaPipe Hands** for robust, fast hand landmark detection.
|
||||
* **PyTorch** for flexible sequence modeling on CPU/MPS.
|
||||
|
||||
---
|
||||
176
first_attempt_landmark_hands/capture_sequence.py
Executable file
176
first_attempt_landmark_hands/capture_sequence.py
Executable file
@@ -0,0 +1,176 @@
|
||||
#!/usr/bin/env python3
|
||||
# capture_sequence.py
|
||||
# Automatically record N short sequences for each label (default: 100 train / 20 val)
|
||||
# Centered 3-second countdown before recording.
|
||||
# Per-clip depleting progress bar (full → empty) across the top during capture.
|
||||
|
||||
import argparse, os, time, math, re
|
||||
from pathlib import Path
|
||||
import numpy as np, cv2, mediapipe as mp
|
||||
|
||||
def normalize_frame(pts, handed=None):
|
||||
pts = pts.astype(np.float32).copy()
|
||||
pts[:, :2] -= pts[0, :2]
|
||||
if handed and handed.lower().startswith("left"):
|
||||
pts[:, 0] *= -1.0
|
||||
v = pts[9, :2]
|
||||
ang = math.atan2(v[1], v[0])
|
||||
c, s = math.cos(math.pi/2 - ang), math.sin(math.pi/2 - ang)
|
||||
R = np.array([[c, -s], [s, c]], np.float32)
|
||||
pts[:, :2] = pts[:, :2] @ R.T
|
||||
xy = pts[:, :2]
|
||||
d = np.max(np.linalg.norm(xy[None,:,:] - xy[:,None,:], axis=-1))
|
||||
if d < 1e-6: d = 1.0
|
||||
pts[:, :2] /= d; pts[:, 2] /= d
|
||||
return pts
|
||||
|
||||
def next_idx(folder: Path, prefix="clip_"):
|
||||
pat = re.compile(rf"^{re.escape(prefix)}(\d+)\.npz$")
|
||||
mx = 0
|
||||
if folder.exists():
|
||||
for n in os.listdir(folder):
|
||||
m = pat.match(n)
|
||||
if m: mx = max(mx, int(m.group(1)))
|
||||
return mx + 1
|
||||
|
||||
def countdown(cap, seconds=3):
|
||||
"""Display a centered countdown before starting capture."""
|
||||
for i in range(seconds, 0, -1):
|
||||
start = time.time()
|
||||
while time.time() - start < 1.0:
|
||||
ok, frame = cap.read()
|
||||
if not ok:
|
||||
continue
|
||||
h, w = frame.shape[:2]
|
||||
|
||||
# Main big number in center
|
||||
text = str(i)
|
||||
font_scale = 5
|
||||
thickness = 10
|
||||
(tw, th), _ = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, font_scale, thickness)
|
||||
cv2.putText(frame, text,
|
||||
((w - tw)//2, (h + th)//2),
|
||||
cv2.FONT_HERSHEY_SIMPLEX, font_scale, (0,0,255), thickness, cv2.LINE_AA)
|
||||
|
||||
# Smaller message above
|
||||
msg = "Starting in..."
|
||||
font_scale_msg = 1.2
|
||||
thickness_msg = 3
|
||||
(mw, mh), _ = cv2.getTextSize(msg, cv2.FONT_HERSHEY_SIMPLEX, font_scale_msg, thickness_msg)
|
||||
cv2.putText(frame, msg,
|
||||
((w - mw)//2, (h//2) - th - 20),
|
||||
cv2.FONT_HERSHEY_SIMPLEX, font_scale_msg, (0,255,255), thickness_msg, cv2.LINE_AA)
|
||||
|
||||
cv2.imshow("sequence capture", frame)
|
||||
if cv2.waitKey(1) & 0xFF == ord('q'):
|
||||
cap.release(); cv2.destroyAllWindows(); raise SystemExit("Aborted during countdown")
|
||||
|
||||
def draw_progress_bar(img, frac_remaining, bar_h=16, margin=12):
|
||||
"""
|
||||
Draw a top progress bar that starts full and depletes to empty.
|
||||
frac_remaining: 1.0 at start → 0.0 at end.
|
||||
"""
|
||||
h, w = img.shape[:2]
|
||||
x0, x1 = margin, w - margin
|
||||
y0, y1 = margin, margin + bar_h
|
||||
|
||||
# Background bar
|
||||
cv2.rectangle(img, (x0, y0), (x1, y1), (40, 40, 40), -1) # dark gray
|
||||
cv2.rectangle(img, (x0, y0), (x1, y1), (90, 90, 90), 2) # border
|
||||
|
||||
# Foreground (remaining)
|
||||
rem_w = int((x1 - x0) * max(0.0, min(1.0, frac_remaining)))
|
||||
if rem_w > 0:
|
||||
cv2.rectangle(img, (x0, y0), (x0 + rem_w, y1), (0, 200, 0), -1) # green
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--label", required=True, help="Letter label (A..Z)")
|
||||
ap.add_argument("--split", required=True, choices=["train","val"])
|
||||
ap.add_argument("--seconds", type=float, default=0.8, help="Clip length (s)")
|
||||
ap.add_argument("--camera", type=int, default=0)
|
||||
ap.add_argument("--width", type=int, default=640)
|
||||
ap.add_argument("--height", type=int, default=480)
|
||||
ap.add_argument("--count", type=int, default=None,
|
||||
help="How many clips (default=100 train, 20 val)")
|
||||
args = ap.parse_args()
|
||||
|
||||
if args.count is None:
|
||||
args.count = 100 if args.split == "train" else 20
|
||||
|
||||
L = args.label.upper().strip()
|
||||
if not (len(L) == 1 and "A" <= L <= "Z"):
|
||||
raise SystemExit("Use --label A..Z")
|
||||
|
||||
out_dir = Path("sequences") / args.split / L
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
idx = next_idx(out_dir)
|
||||
|
||||
hands = mp.solutions.hands.Hands(
|
||||
static_image_mode=False, max_num_hands=1, min_detection_confidence=0.5
|
||||
)
|
||||
cap = cv2.VideoCapture(args.camera)
|
||||
if not cap.isOpened():
|
||||
raise SystemExit(f"Could not open camera {args.camera}")
|
||||
cap.set(cv2.CAP_PROP_FRAME_WIDTH, args.width)
|
||||
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, args.height)
|
||||
|
||||
print(f"Recording {args.count} clips for {L}/{args.split}, {args.seconds}s each.")
|
||||
countdown(cap, 3)
|
||||
|
||||
for n in range(args.count):
|
||||
seq_X, seq_tip = [], []
|
||||
start_t = time.time()
|
||||
end_t = start_t + args.seconds
|
||||
|
||||
while True:
|
||||
now = time.time()
|
||||
if now >= end_t:
|
||||
break
|
||||
|
||||
ok, fr = cap.read()
|
||||
if not ok:
|
||||
break
|
||||
|
||||
rgb = cv2.cvtColor(fr, cv2.COLOR_BGR2RGB)
|
||||
res = hands.process(rgb)
|
||||
if res.multi_hand_landmarks:
|
||||
ih = res.multi_hand_landmarks[0]
|
||||
handed = None
|
||||
if res.multi_handedness:
|
||||
handed = res.multi_handedness[0].classification[0].label
|
||||
pts = np.array([[lm.x, lm.y, lm.z] for lm in ih.landmark], np.float32)
|
||||
pts = normalize_frame(pts, handed)
|
||||
seq_X.append(pts.reshape(-1))
|
||||
seq_tip.append(pts[8, :2])
|
||||
|
||||
# draw fingertip marker (for feedback)
|
||||
cv2.circle(fr,
|
||||
(int(fr.shape[1] * pts[8, 0]), int(fr.shape[0] * pts[8, 1])),
|
||||
6, (0, 255, 0), -1)
|
||||
|
||||
# overlay progress + status
|
||||
frac_remaining = (end_t - now) / max(1e-6, args.seconds) # 1 → 0
|
||||
draw_progress_bar(fr, frac_remaining, bar_h=16, margin=12)
|
||||
cv2.putText(fr, f"{L} {args.split} Clip {n+1}/{args.count}",
|
||||
(20, 40), cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0,255,0), 2, cv2.LINE_AA)
|
||||
|
||||
cv2.imshow("sequence capture", fr)
|
||||
if cv2.waitKey(1) & 0xFF == ord('q'):
|
||||
cap.release(); cv2.destroyAllWindows(); return
|
||||
|
||||
if seq_X:
|
||||
X = np.stack(seq_X, 0)
|
||||
tip = np.stack(seq_tip, 0)
|
||||
path = out_dir / f"clip_{idx:03d}.npz"
|
||||
np.savez_compressed(path, X=X, tip=tip)
|
||||
print(f"💾 saved {path} frames={X.shape[0]}")
|
||||
idx += 1
|
||||
else:
|
||||
print("⚠️ No hand detected; skipped clip.")
|
||||
|
||||
print("✅ Done recording.")
|
||||
cap.release(); cv2.destroyAllWindows()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
60
first_attempt_landmark_hands/eval_val.py
Executable file
60
first_attempt_landmark_hands/eval_val.py
Executable file
@@ -0,0 +1,60 @@
|
||||
#!/usr/bin/env python3
|
||||
# eval_seq_val.py
|
||||
import os, json, argparse
|
||||
import numpy as np
|
||||
import torch, torch.nn as nn
|
||||
from sklearn.metrics import classification_report, confusion_matrix
|
||||
|
||||
class SeqGRU(nn.Module):
|
||||
def __init__(self, input_dim=63, hidden=128, num_classes=26):
|
||||
super().__init__()
|
||||
self.gru = nn.GRU(input_dim, hidden, batch_first=True, bidirectional=True)
|
||||
self.head = nn.Sequential(
|
||||
nn.Linear(hidden*2, 128),
|
||||
nn.ReLU(),
|
||||
nn.Dropout(0.2),
|
||||
nn.Linear(128, num_classes),
|
||||
)
|
||||
def forward(self, x):
|
||||
h,_ = self.gru(x)
|
||||
h_last = h[:, -1, :]
|
||||
return self.head(h_last)
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--landmarks", default="landmarks_seq32")
|
||||
ap.add_argument("--model", default="asl_seq32_gru_ABJZ.pt")
|
||||
args = ap.parse_args()
|
||||
|
||||
vaX = np.load(os.path.join(args.landmarks,"val_X.npy")) # (N, T, 63)
|
||||
vaY = np.load(os.path.join(args.landmarks,"val_y.npy"))
|
||||
classes = json.load(open(os.path.join(args.landmarks,"class_names.json")))
|
||||
meta = json.load(open(os.path.join(args.landmarks,"meta.json")))
|
||||
T = int(meta.get("frames", 32))
|
||||
|
||||
state = torch.load(args.model, map_location="cpu", weights_only=False)
|
||||
X_mean, X_std = state["X_mean"], state["X_std"]
|
||||
if isinstance(X_mean, torch.Tensor): X_mean = X_mean.numpy()
|
||||
if isinstance(X_std, torch.Tensor): X_std = X_std.numpy()
|
||||
X_mean = X_mean.astype(np.float32)
|
||||
X_std = (X_std.astype(np.float32) + 1e-6)
|
||||
|
||||
vaXn = (vaX - X_mean) / X_std
|
||||
|
||||
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
|
||||
model = SeqGRU(63, 128, num_classes=len(classes))
|
||||
model.load_state_dict(state["model"])
|
||||
model.eval().to(device)
|
||||
|
||||
with torch.no_grad():
|
||||
xb = torch.from_numpy(vaXn).float().to(device)
|
||||
logits = model(xb)
|
||||
pred = logits.argmax(1).cpu().numpy()
|
||||
|
||||
cm = confusion_matrix(vaY, pred)
|
||||
print("Classes:", classes)
|
||||
print("\nConfusion matrix (rows=true, cols=pred):\n", cm)
|
||||
print("\nReport:\n", classification_report(vaY, pred, target_names=classes))
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
198
first_attempt_landmark_hands/infer_seq_webcam.py
Executable file
198
first_attempt_landmark_hands/infer_seq_webcam.py
Executable file
@@ -0,0 +1,198 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
infer_seq_webcam.py
|
||||
Live webcam demo: detect a hand with MediaPipe, normalize landmarks,
|
||||
classify with a trained sequence GRU model (multiclass).
|
||||
|
||||
Examples:
|
||||
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.8 --smooth 0.7
|
||||
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.85 --smooth 1.0 --url https://www.google.com
|
||||
"""
|
||||
|
||||
import os, math, argparse, time, webbrowser
|
||||
import numpy as np
|
||||
import cv2
|
||||
import torch
|
||||
import mediapipe as mp
|
||||
|
||||
# --- Quiet logs ---
|
||||
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
|
||||
os.environ["GLOG_minloglevel"] = "2"
|
||||
import absl.logging
|
||||
absl.logging.set_verbosity(absl.logging.ERROR)
|
||||
cv2.setLogLevel(0)
|
||||
|
||||
# ---------- geometry helpers ----------
|
||||
def _angle(v): return math.atan2(v[1], v[0])
|
||||
def _rot2d(t):
|
||||
c, s = math.cos(t), math.sin(t)
|
||||
return np.array([[c, -s], [s, c]], dtype=np.float32)
|
||||
|
||||
def normalize_landmarks(pts, handedness_label=None):
|
||||
"""
|
||||
pts: (21,3) MediaPipe normalized coords in [0..1]
|
||||
Steps: translate wrist->origin, mirror left to right, rotate to +Y, scale by max pairwise distance.
|
||||
Returns: (63,) float32
|
||||
"""
|
||||
pts = pts.astype(np.float32).copy()
|
||||
pts[:, :2] -= pts[0, :2]
|
||||
if handedness_label and handedness_label.lower().startswith("left"):
|
||||
pts[:, 0] *= -1.0
|
||||
v = pts[9, :2] # middle MCP
|
||||
R = _rot2d(math.pi/2 - _angle(v))
|
||||
pts[:, :2] = pts[:, :2] @ R.T
|
||||
xy = pts[:, :2]
|
||||
d = np.linalg.norm(xy[None,:,:] - xy[:,None,:], axis=-1).max()
|
||||
d = 1.0 if d < 1e-6 else float(d)
|
||||
pts[:, :2] /= d; pts[:, 2] /= d
|
||||
return pts.reshape(-1)
|
||||
|
||||
# ---------- sequence model ----------
|
||||
class SeqGRU(torch.nn.Module):
|
||||
def __init__(self, input_dim=63, hidden=128, num_classes=26):
|
||||
super().__init__()
|
||||
self.gru = torch.nn.GRU(input_dim, hidden, batch_first=True, bidirectional=True)
|
||||
self.head = torch.nn.Sequential(
|
||||
torch.nn.Linear(hidden*2, 128),
|
||||
torch.nn.ReLU(),
|
||||
torch.nn.Dropout(0.2),
|
||||
torch.nn.Linear(128, num_classes),
|
||||
)
|
||||
def forward(self, x):
|
||||
h, _ = self.gru(x) # (B,T,2H)
|
||||
h_last = h[:, -1, :] # or h.mean(1)
|
||||
return self.head(h_last)
|
||||
|
||||
# ---------- main ----------
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--model", required=True, help="Path to trained .pt model")
|
||||
ap.add_argument("--camera", type=int, default=0)
|
||||
ap.add_argument("--threshold", type=float, default=0.8)
|
||||
ap.add_argument("--smooth", type=float, default=0.7,
|
||||
help="EMA smoothing window in seconds (0 disables smoothing)")
|
||||
ap.add_argument("--width", type=int, default=640)
|
||||
ap.add_argument("--height", type=int, default=480)
|
||||
ap.add_argument("--url", type=str, default="https://www.google.com",
|
||||
help="URL to open when the sequence W→E→B is detected")
|
||||
args = ap.parse_args()
|
||||
|
||||
if not os.path.exists(args.model):
|
||||
raise SystemExit(f"❌ Model file not found: {args.model}")
|
||||
|
||||
# Load checkpoint (support numpy or tensor stats; support 'frames' if present)
|
||||
state = torch.load(args.model, map_location="cpu", weights_only=False)
|
||||
classes = state["classes"]
|
||||
T = int(state.get("frames", 32))
|
||||
|
||||
X_mean, X_std = state["X_mean"], state["X_std"]
|
||||
if isinstance(X_mean, torch.Tensor): X_mean = X_mean.cpu().numpy()
|
||||
if isinstance(X_std, torch.Tensor): X_std = X_std.cpu().numpy()
|
||||
X_mean = X_mean.astype(np.float32)
|
||||
X_std = (X_std.astype(np.float32) + 1e-6)
|
||||
|
||||
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
|
||||
model = SeqGRU(63, 128, num_classes=len(classes)).to(device)
|
||||
model.load_state_dict(state["model"])
|
||||
model.eval()
|
||||
|
||||
hands = mp.solutions.hands.Hands(
|
||||
static_image_mode=False, max_num_hands=1, min_detection_confidence=0.5
|
||||
)
|
||||
|
||||
cap = cv2.VideoCapture(args.camera)
|
||||
if not cap.isOpened():
|
||||
raise SystemExit(f"❌ Could not open camera index {args.camera}")
|
||||
cap.set(cv2.CAP_PROP_FRAME_WIDTH, args.width)
|
||||
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, args.height)
|
||||
|
||||
print(f"✅ Loaded {args.model} frames={T} classes={classes}")
|
||||
print("Press 'q' to quit.")
|
||||
|
||||
seq_buffer, ema_probs = [], None
|
||||
last_ts = time.time()
|
||||
last_emitted_letter = None
|
||||
|
||||
# Rolling history of emitted letters to detect the sequence "WEB"
|
||||
detected_history = [] # only stores emitted letters (deduped by change)
|
||||
|
||||
while True:
|
||||
ok, frame = cap.read()
|
||||
if not ok: break
|
||||
now = time.time()
|
||||
dt = max(1e-6, now - last_ts)
|
||||
last_ts = now
|
||||
|
||||
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
|
||||
res = hands.process(rgb)
|
||||
|
||||
overlay_text = "No hand"
|
||||
current_letter = None
|
||||
|
||||
if res.multi_hand_landmarks:
|
||||
ih = res.multi_hand_landmarks[0]
|
||||
handed = None
|
||||
if res.multi_handedness:
|
||||
handed = res.multi_handedness[0].classification[0].label
|
||||
pts = np.array([[lm.x, lm.y, lm.z] for lm in ih.landmark], dtype=np.float32)
|
||||
feat = normalize_landmarks(pts, handedness_label=handed)
|
||||
seq_buffer.append(feat)
|
||||
if len(seq_buffer) > T: seq_buffer.pop(0)
|
||||
|
||||
if len(seq_buffer) == T:
|
||||
X = np.stack(seq_buffer, 0)
|
||||
Xn = (X - X_mean) / X_std
|
||||
xt = torch.from_numpy(Xn).float().unsqueeze(0).to(device)
|
||||
with torch.no_grad():
|
||||
logits = model(xt)
|
||||
probs = torch.softmax(logits, dim=1)[0].cpu().numpy()
|
||||
|
||||
if args.smooth > 0:
|
||||
alpha = 1.0 - math.exp(-dt / args.smooth)
|
||||
if ema_probs is None: ema_probs = probs
|
||||
else: ema_probs = (1.0 - alpha) * ema_probs + alpha * probs
|
||||
use_probs = ema_probs
|
||||
else:
|
||||
use_probs = probs
|
||||
|
||||
top_idx = int(np.argmax(use_probs))
|
||||
top_p = float(use_probs[top_idx])
|
||||
top_cls = classes[top_idx]
|
||||
|
||||
if top_p >= args.threshold:
|
||||
overlay_text = f"{top_cls} {top_p*100:.1f}%"
|
||||
current_letter = top_cls
|
||||
else:
|
||||
seq_buffer, ema_probs = [], None
|
||||
|
||||
# Only emit when a *letter* changes (ignore no-hand and repeats)
|
||||
if current_letter is not None and current_letter != last_emitted_letter:
|
||||
print(f"Detected: {current_letter}")
|
||||
last_emitted_letter = current_letter
|
||||
|
||||
# Update rolling history
|
||||
detected_history.append(current_letter)
|
||||
if len(detected_history) > 3:
|
||||
detected_history.pop(0)
|
||||
|
||||
# Check for special sequence "WEB"
|
||||
if detected_history == ["W", "E", "B"]:
|
||||
print("🚀 Detected WEB! Time to open the web browser app.")
|
||||
try:
|
||||
webbrowser.open(args.url)
|
||||
except Exception as e:
|
||||
print(f"⚠️ Failed to open browser: {e}")
|
||||
detected_history.clear() # fire once per occurrence
|
||||
|
||||
# On-screen overlay (still shows "No hand" when nothing is detected)
|
||||
cv2.putText(frame, overlay_text, (20, 40),
|
||||
cv2.FONT_HERSHEY_SIMPLEX, 1.1, (0,255,0), 2)
|
||||
cv2.imshow("ASL sequence demo", frame)
|
||||
if cv2.waitKey(1) & 0xFF == ord('q'):
|
||||
break
|
||||
|
||||
cap.release()
|
||||
cv2.destroyAllWindows()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
19
first_attempt_landmark_hands/make_seq_dirs.sh
Executable file
19
first_attempt_landmark_hands/make_seq_dirs.sh
Executable file
@@ -0,0 +1,19 @@
|
||||
#!/usr/bin/env bash
|
||||
# Create sequences/<train|val>/<LETTER>/ for the given letters.
|
||||
# Example: ./make_seq_dirs.sh A B J Z
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
if [ "$#" -lt 1 ]; then
|
||||
echo "Usage: $0 LETTER [LETTER ...] e.g. $0 A B J Z"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
ROOT="sequences"
|
||||
for SPLIT in train val; do
|
||||
for L in "$@"; do
|
||||
mkdir -p "$ROOT/$SPLIT/$L"
|
||||
done
|
||||
done
|
||||
|
||||
echo "✅ Created $ROOT/train and $ROOT/val for: $*"
|
||||
71
first_attempt_landmark_hands/prep_sequence_resampled.py
Executable file
71
first_attempt_landmark_hands/prep_sequence_resampled.py
Executable file
@@ -0,0 +1,71 @@
|
||||
#!/usr/bin/env python3
|
||||
# prep_sequence_resampled.py
|
||||
# Build a fixed-length (N frames) multiclass dataset from sequences/<split>/<CLASS>/clip_*.npz
|
||||
import argparse, os, glob, json
|
||||
from pathlib import Path
|
||||
import numpy as np
|
||||
|
||||
def resample_sequence(X, N=32):
|
||||
# X: (T,63) -> (N,63) by linear interpolation along frame index
|
||||
T = len(X)
|
||||
if T == 0:
|
||||
return np.zeros((N, X.shape[1]), np.float32)
|
||||
if T == 1:
|
||||
return np.repeat(X, N, axis=0)
|
||||
src = np.linspace(0, T-1, num=T)
|
||||
dst = np.linspace(0, T-1, num=N)
|
||||
out = np.zeros((N, X.shape[1]), np.float32)
|
||||
for d in range(X.shape[1]):
|
||||
out[:, d] = np.interp(dst, src, X[:, d])
|
||||
return out.astype(np.float32)
|
||||
|
||||
def load_classes(seq_root: Path):
|
||||
# classes are subdirs in sequences/train/
|
||||
classes = sorted([p.name for p in (seq_root/"train").iterdir() if p.is_dir()])
|
||||
classes = [c for c in classes if len(c)==1 and "A"<=c<="Z"]
|
||||
if not classes:
|
||||
raise SystemExit("No letter classes found in sequences/train/")
|
||||
return classes
|
||||
|
||||
def collect_split(seq_root: Path, split: str, classes, N):
|
||||
Xs, ys = [], []
|
||||
for ci, cls in enumerate(classes):
|
||||
for f in sorted(glob.glob(str(seq_root/split/cls/"clip_*.npz"))):
|
||||
d = np.load(f)
|
||||
Xi = d["X"].astype(np.float32) # (T,63)
|
||||
XiN = resample_sequence(Xi, N) # (N,63)
|
||||
Xs.append(XiN); ys.append(ci)
|
||||
if Xs:
|
||||
X = np.stack(Xs, 0)
|
||||
y = np.array(ys, np.int64)
|
||||
else:
|
||||
X = np.zeros((0, N, 63), np.float32); y = np.zeros((0,), np.int64)
|
||||
return X, y
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--in", dest="in_dir", default="sequences", help="Root sequences/ with train/ and val/")
|
||||
ap.add_argument("--out", default="landmarks_seq32", help="Output folder with npy files")
|
||||
ap.add_argument("--frames", type=int, default=32, help="Frames per clip after resampling (default: 32)")
|
||||
args = ap.parse_args()
|
||||
|
||||
seq_root = Path(args.in_dir)
|
||||
outdir = Path(args.out)
|
||||
outdir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
classes = load_classes(seq_root)
|
||||
trX, trY = collect_split(seq_root, "train", classes, args.frames)
|
||||
vaX, vaY = collect_split(seq_root, "val", classes, args.frames)
|
||||
|
||||
np.save(outdir/"train_X.npy", trX)
|
||||
np.save(outdir/"train_y.npy", trY)
|
||||
np.save(outdir/"val_X.npy", vaX)
|
||||
np.save(outdir/"val_y.npy", vaY)
|
||||
json.dump(classes, open(outdir/"class_names.json", "w"))
|
||||
json.dump({"frames": args.frames, "input_dim": 63}, open(outdir/"meta.json","w"))
|
||||
|
||||
print(f"Saved dataset → {outdir}")
|
||||
print(f" train {trX.shape}, val {vaX.shape}, classes={classes}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
136
first_attempt_landmark_hands/train_seq.py
Executable file
136
first_attempt_landmark_hands/train_seq.py
Executable file
@@ -0,0 +1,136 @@
|
||||
#!/usr/bin/env python3
|
||||
# train_seq.py
|
||||
import os, json, argparse
|
||||
import numpy as np
|
||||
import torch, torch.nn as nn
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
|
||||
def get_device():
|
||||
return torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
|
||||
|
||||
class SeqDataset(Dataset):
|
||||
def __init__(self, X, y, augment=False):
|
||||
self.X = X.astype(np.float32) # (Nclip, T, 63)
|
||||
self.y = y.astype(np.int64)
|
||||
self.augment = augment
|
||||
|
||||
def __len__(self): return len(self.y)
|
||||
|
||||
def _augment(self, seq): # seq: (T,63)
|
||||
T = seq.shape[0]
|
||||
pts = seq.reshape(T, 21, 3).copy()
|
||||
# small 2D rotation (±7°) + scale (±10%) + Gaussian noise (σ=0.01)
|
||||
ang = np.deg2rad(np.random.uniform(-7, 7))
|
||||
c, s = np.cos(ang), np.sin(ang)
|
||||
R = np.array([[c,-s],[s,c]], np.float32)
|
||||
scale = np.random.uniform(0.9, 1.1)
|
||||
pts[:, :, :2] = (pts[:, :, :2] @ R.T) * scale
|
||||
pts += np.random.normal(0, 0.01, size=pts.shape).astype(np.float32)
|
||||
return pts.reshape(T, 63)
|
||||
|
||||
def __getitem__(self, i):
|
||||
xi = self.X[i]
|
||||
if self.augment:
|
||||
xi = self._augment(xi)
|
||||
return torch.from_numpy(xi).float(), int(self.y[i])
|
||||
|
||||
class SeqGRU(nn.Module):
|
||||
def __init__(self, input_dim=63, hidden=128, num_classes=26):
|
||||
super().__init__()
|
||||
self.gru = nn.GRU(input_dim, hidden, batch_first=True, bidirectional=True)
|
||||
self.head = nn.Sequential(
|
||||
nn.Linear(hidden*2, 128),
|
||||
nn.ReLU(),
|
||||
nn.Dropout(0.2),
|
||||
nn.Linear(128, num_classes),
|
||||
)
|
||||
def forward(self, x): # x: (B,T,63)
|
||||
h,_ = self.gru(x) # (B,T,2H)
|
||||
h_last = h[:, -1, :] # or mean over time: h.mean(1)
|
||||
return self.head(h_last)
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--landmarks", default="landmarks_seq32", help="Folder from prep_sequence_resampled.py")
|
||||
ap.add_argument("--epochs", type=int, default=40)
|
||||
ap.add_argument("--batch", type=int, default=64)
|
||||
ap.add_argument("--lr", type=float, default=1e-3)
|
||||
ap.add_argument("--out", default="asl_seq32_gru.pt")
|
||||
args = ap.parse_args()
|
||||
|
||||
# Load dataset
|
||||
trX = np.load(os.path.join(args.landmarks,"train_X.npy")) # (N, T, 63)
|
||||
trY = np.load(os.path.join(args.landmarks,"train_y.npy"))
|
||||
vaX = np.load(os.path.join(args.landmarks,"val_X.npy"))
|
||||
vaY = np.load(os.path.join(args.landmarks,"val_y.npy"))
|
||||
classes = json.load(open(os.path.join(args.landmarks,"class_names.json")))
|
||||
meta = json.load(open(os.path.join(args.landmarks,"meta.json")))
|
||||
T = int(meta["frames"])
|
||||
|
||||
print(f"Loaded: train {trX.shape} val {vaX.shape} classes={classes}")
|
||||
|
||||
# Global mean/std over train (time+batch)
|
||||
X_mean = trX.reshape(-1, trX.shape[-1]).mean(axis=0, keepdims=True).astype(np.float32) # (1,63)
|
||||
X_std = trX.reshape(-1, trX.shape[-1]).std(axis=0, keepdims=True).astype(np.float32) + 1e-6
|
||||
trXn = (trX - X_mean) / X_std
|
||||
vaXn = (vaX - X_mean) / X_std
|
||||
|
||||
tr_ds = SeqDataset(trXn, trY, augment=True)
|
||||
va_ds = SeqDataset(vaXn, vaY, augment=False)
|
||||
tr_dl = DataLoader(tr_ds, batch_size=args.batch, shuffle=True)
|
||||
va_dl = DataLoader(va_ds, batch_size=args.batch, shuffle=False)
|
||||
|
||||
device = get_device()
|
||||
model = SeqGRU(input_dim=63, hidden=128, num_classes=len(classes)).to(device)
|
||||
crit = nn.CrossEntropyLoss()
|
||||
opt = torch.optim.AdamW(model.parameters(), lr=args.lr, weight_decay=1e-4)
|
||||
sch = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=args.epochs)
|
||||
|
||||
best_acc, best_state = 0.0, None
|
||||
for epoch in range(1, args.epochs+1):
|
||||
# Train
|
||||
model.train()
|
||||
tot, correct, loss_sum = 0, 0, 0.0
|
||||
for xb, yb in tr_dl:
|
||||
xb, yb = xb.to(device), yb.to(device)
|
||||
opt.zero_grad(set_to_none=True)
|
||||
logits = model(xb)
|
||||
loss = crit(logits, yb)
|
||||
loss.backward()
|
||||
opt.step()
|
||||
loss_sum += loss.item() * yb.size(0)
|
||||
correct += (logits.argmax(1)==yb).sum().item()
|
||||
tot += yb.size(0)
|
||||
tr_loss = loss_sum / max(1, tot)
|
||||
tr_acc = correct / max(1, tot)
|
||||
|
||||
# Validate
|
||||
model.eval()
|
||||
vtot, vcorrect = 0, 0
|
||||
with torch.no_grad():
|
||||
for xb, yb in va_dl:
|
||||
xb, yb = xb.to(device), yb.to(device)
|
||||
logits = model(xb)
|
||||
vcorrect += (logits.argmax(1)==yb).sum().item()
|
||||
vtot += yb.size(0)
|
||||
va_acc = vcorrect / max(1, vtot)
|
||||
sch.step()
|
||||
|
||||
print(f"Epoch {epoch:02d}: train_loss={tr_loss:.4f} train_acc={tr_acc:.3f} val_acc={va_acc:.3f}")
|
||||
|
||||
if va_acc > best_acc:
|
||||
best_acc = va_acc
|
||||
best_state = {
|
||||
"model": model.state_dict(),
|
||||
"classes": classes,
|
||||
"frames": T,
|
||||
"X_mean": torch.from_numpy(X_mean), # tensors → future-proof
|
||||
"X_std": torch.from_numpy(X_std),
|
||||
}
|
||||
torch.save(best_state, args.out)
|
||||
print(f" ✅ Saved best → {args.out} (val_acc={best_acc:.3f})")
|
||||
|
||||
print("Done. Best val_acc:", best_acc)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
24
first_attempt_landmark_hands/what_to_do.txt
Normal file
24
first_attempt_landmark_hands/what_to_do.txt
Normal file
@@ -0,0 +1,24 @@
|
||||
# 1) Create dirs
|
||||
# ./make_seq_dirs.sh A B J Z
|
||||
|
||||
# 2) Capture clips (0.8s each by default)
|
||||
python capture_sequence.py --label A --split train
|
||||
python capture_sequence.py --label A --split val
|
||||
python capture_sequence.py --label B --split train
|
||||
python capture_sequence.py --label B --split val
|
||||
python capture_sequence.py --label J --split train
|
||||
python capture_sequence.py --label J --split val
|
||||
python capture_sequence.py --label Z --split train
|
||||
python capture_sequence.py --label Z --split val
|
||||
|
||||
# 3) Preprocess to 32 frames (auto-picks classes from sequences/train/*)
|
||||
python prep_sequence_resampled.py --in sequences --out landmarks_seq32 --frames 32
|
||||
|
||||
# 4) Train GRU (multiclass on A/B/J/Z)
|
||||
python train_seq.py --landmarks landmarks_seq32 --epochs 40 --batch 64 --lr 1e-3 --out asl_seq32_gru_ABJZ.pt
|
||||
|
||||
# 5) Live inference
|
||||
python infer_seq_webcam.py --model asl_seq32_gru_ABJZ.pt --threshold 0.6 --smooth 0.2
|
||||
|
||||
# If you later add more letters (e.g., C, D),
|
||||
# just create those folders, record clips, re-run the prep step, then train again — the pipeline will include whatever letters exist under sequences/train/.
|
||||
Reference in New Issue
Block a user