HTML demos for face, hand, gesture, and posture tracking using MediaPipe. Includes Python CLI tools for processing video files. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
180 lines
5.7 KiB
Markdown
180 lines
5.7 KiB
Markdown
Let’s add a custom gesture for the **ASL letter “B”** (flat hand, fingers together, thumb folded across the palm) using MediaPipe **Gesture Recognizer (Model Maker)**.
|
||
|
||
# Plan (what you’ll build)
|
||
|
||
* A custom model with a new class label, e.g. `ASL_B`, plus the required `none` class.
|
||
* A small, labeled image dataset (Model Maker will extract hand landmarks for you).
|
||
* A trained `.task` file you can drop into your Python/JS app and allowlist.
|
||
|
||
---
|
||
|
||
# 1) Pick labels
|
||
|
||
Use:
|
||
|
||
* `ASL_B` ← your new gesture
|
||
* `none` ← anything that’s not one of your target gestures (mandatory)
|
||
|
||
Folder layout:
|
||
|
||
```
|
||
dataset/
|
||
ASL_B/
|
||
...images...
|
||
none/
|
||
...images...
|
||
```
|
||
|
||
---
|
||
|
||
# 2) Collect the right data (what to capture)
|
||
|
||
Target handshape for **B**:
|
||
|
||
* **Fingers**: index–pinky fully extended and **pressed together**
|
||
* **Thumb**: folded across palm (not sticking out to the side)
|
||
* **Palm**: facing camera (front) and also a few angles
|
||
|
||
Suggested minimums (per label):
|
||
|
||
| Bucket | Shots |
|
||
| --------------------------------------------------- | -------------------- |
|
||
| Distances: close (\~40–60 cm), medium (\~80–120 cm) | 80 |
|
||
| View angles: front, \~30°, \~60° yaw | 80 |
|
||
| Rotations: slight roll/tilt | 40 |
|
||
| Lighting: bright, dim, backlit | 40 |
|
||
| Backgrounds: plain wall, cluttered office/outdoor | 40 |
|
||
| Hands: left & right (both) | included across all |
|
||
| Skin tones / several people | as many as practical |
|
||
|
||
Do **at least \~300–500** `ASL_B` images to start.
|
||
For **`none`**, include: open palm (“High-Five”), slightly spread fingers, thumbs-up, fist, pointing, random objects/background frames, other ASL letters—especially **Open\_Palm** look-alikes so the model learns “not B”.
|
||
|
||
Quick ways to get images:
|
||
|
||
* Record short clips on laptop/phone and extract frames (e.g., 2 fps).
|
||
* Ask 3–5 colleagues to contribute a short 10–20s clip each.
|
||
|
||
Frame extraction example:
|
||
|
||
```bash
|
||
# Extract 2 frames/sec from a video into dataset/ASL_B/
|
||
ffmpeg -i b_sign.mov -vf fps=2 dataset/ASL_B/b_%05d.jpg
|
||
# Do the same for negatives into dataset/none/
|
||
```
|
||
|
||
---
|
||
|
||
# 3) Train with Model Maker (Python)
|
||
|
||
Create and activate a venv, then:
|
||
|
||
```bash
|
||
pip install --upgrade pip
|
||
pip install mediapipe-model-maker
|
||
```
|
||
|
||
Training script (save as `train_asl_b.py` and run it):
|
||
|
||
```python
|
||
from mediapipe_model_maker import gesture_recognizer as gr
|
||
|
||
DATA_DIR = "dataset"
|
||
EXPORT_DIR = "exported_model"
|
||
|
||
# Load & auto-preprocess (runs hand detection, keeps images with a detected hand)
|
||
data = gr.Dataset.from_folder(
|
||
dirname=DATA_DIR,
|
||
hparams=gr.HandDataPreprocessingParams( # you can tweak these if needed
|
||
min_detection_confidence=0.5
|
||
)
|
||
)
|
||
|
||
# Split
|
||
train_data, rest = data.split(0.8)
|
||
validation_data, test_data = rest.split(0.5)
|
||
|
||
# Hyperparameters (start small; bump epochs if needed)
|
||
hparams = gr.HParams(
|
||
export_dir=EXPORT_DIR,
|
||
epochs=12,
|
||
batch_size=16,
|
||
learning_rate=0.001,
|
||
)
|
||
|
||
# Optional model head size & dropout
|
||
options = gr.GestureRecognizerOptions(
|
||
hparams=hparams,
|
||
model_options=gr.ModelOptions(layer_widths=[128, 64], dropout_rate=0.1)
|
||
)
|
||
|
||
model = gr.GestureRecognizer.create(
|
||
train_data=train_data,
|
||
validation_data=validation_data,
|
||
options=options
|
||
)
|
||
|
||
# Evaluate
|
||
loss, acc = model.evaluate(test_data, batch_size=1)
|
||
print(f"Test loss={loss:.4f}, acc={acc:.4f}")
|
||
|
||
# Export .task
|
||
model.export_model() # writes exported_model/gesture_recognizer.task
|
||
print("Exported:", EXPORT_DIR + "/gesture_recognizer.task")
|
||
```
|
||
|
||
Tips:
|
||
|
||
* If many `ASL_B` images get dropped at load time (no hand detected), back up the camera a little or ensure the whole hand is visible.
|
||
* If `none` is weak, add more “near-miss” negatives: open palm with fingers slightly apart, thumb slightly out, partial occlusions.
|
||
|
||
---
|
||
|
||
# 4) Plug it into your app
|
||
|
||
**Python (Tasks API example):**
|
||
|
||
```python
|
||
import mediapipe as mp
|
||
BaseOptions = mp.tasks.BaseOptions
|
||
GestureRecognizer = mp.tasks.vision.GestureRecognizer
|
||
GestureRecognizerOptions = mp.tasks.vision.GestureRecognizerOptions
|
||
VisionRunningMode = mp.tasks.vision.RunningMode
|
||
ClassifierOptions = mp.tasks.components.processors.ClassifierOptions
|
||
|
||
options = GestureRecognizerOptions(
|
||
base_options=BaseOptions(model_asset_path="exported_model/gesture_recognizer.task"),
|
||
running_mode=VisionRunningMode.LIVE_STREAM,
|
||
custom_gesture_classifier_options=ClassifierOptions(
|
||
score_threshold=0.6, # tighten until false positives drop
|
||
category_allowlist=["ASL_B"] # only report your class
|
||
),
|
||
)
|
||
recognizer = GestureRecognizer.create_from_options(options)
|
||
```
|
||
|
||
**Web (JS):**
|
||
|
||
```js
|
||
const recognizer = await GestureRecognizer.createFromOptions(fileset, {
|
||
baseOptions: { modelAssetPath: "exported_model/gesture_recognizer.task" },
|
||
runningMode: "LIVE_STREAM",
|
||
customGesturesClassifierOptions: {
|
||
scoreThreshold: 0.6,
|
||
categoryAllowlist: ["ASL_B"]
|
||
}
|
||
});
|
||
```
|
||
|
||
---
|
||
|
||
# 5) Troubleshooting & tuning
|
||
|
||
* **False positives with Open Palm:** Add more `none` examples where fingers are together but **thumb is visible** to the side. The model needs to see “almost B but not B.”
|
||
* **Left vs right hand:** Include both in training. If you only trained on right hands, left hands may underperform.
|
||
* **Distance issues:** If far-away hands fail, capture more medium/far shots. Landmarks get noisier when small.
|
||
* **Thresholds:** Raise `score_threshold` to reduce spurious detections; lower it if you miss true B’s.
|
||
* **Confusion matrix:** If accuracy is fine but live results wobble, collect more from the exact camera/lighting you’ll use.
|
||
|
||
---
|