VidCraft supports two AI avatar platforms with opposite models:
| Aspect | HeyGen | Synthesia |
|---|---|---|
| Model | Scene-based | Slide-based |
| Background | 1 per scene | Variable per slide |
| Char limit | ~5,000 / scene (API) | ~1,000 / slide |
| Languages | 40+ | 130+ |
| Avatars | 100+ premium + custom | 160+ + custom |
| Strength | Cinematic, brand videos | Slide decks, multi-lingual |
| Skill | heygen-engineer |
synthesia-engineer |
Platform-specific constraints live versionably in
knowledge/platform-checklist.md. Skills reference this file instead of duplicating rules.
| Constraint | Detail |
|---|---|
| One background per scene | One image, video, or color per HeyGen scene. Multiple backgrounds = multiple scenes. |
| No timed text overlays | Overlays are visible for the entire scene. Timed overlays = post-production. |
| Max 5,000 characters per scene | Hard API limit. AI Studio auto-splits at ~1,000 chars/segment — no manual splitting needed for length. Only split manually for background or avatar changes. |
| Max scenes per video | Plan-dependent; check current limit. |
| One avatar per scene | No multi-avatar scenes. |
Important: Pause markers and SSML tags only work with Custom Voices (voice clones, ElevenLabs, OpenAI Voices). The public HeyGen Voice Library silently ignores all pause syntax.
| Marker | HeyGen behavior | Requirement |
|---|---|---|
[pause 0.5s] |
0.5 second pause | Custom Voice only |
[pause 1s] |
1 second pause | Custom Voice only |
| Paragraph break | ~0.5 second pause (default) | All voices |
Fallback for public voices: Use punctuation-based pacing instead:
, → short pause (~300ms). → longer pause (~600ms) with falling intonation- → syllable break for pronunciation claritySplit a scene only when:
The 5,000-char API limit is rarely hit in practice. AI Studio auto-splits long segments at ~1,000 chars — no manual intervention needed for length alone.
When splitting:
Cannot be done in HeyGen — must go to Shotcut/Kdenlive/Premiere:
Mark in script with [post-production ...] syntax.
Control vocal tone per scene via emotion presets or natural language prompts (HeyGen AI Studio → Voice → Voice Director).
| Preset | Best For |
|---|---|
Casual |
Tutorials, developer content |
Calm |
Support, step-by-step explanations |
Excited |
Product launches, CTAs |
Serious |
Compliance, authoritative content |
Cool |
Thought leadership, brand |
Free-form alternative: "Speak in a warm, encouraging tone." — set as a natural language prompt.
Always set Voice Director explicitly; the default neutral tone rarely matches the content mood.
HeyGen Avatar IV (May 2025) supports custom gesture control via natural language Motion Prompts.
Syntax: [Body part] + [Action] + [Emotion/Intensity]
"Right arm raises to wave enthusiastically."
"Nods gently to emphasize agreement."
"Points forward with confidence."
"Looks surprised and raises eyebrows."
"Avatar smiles softly while raising a hand."
Rules:
HeyGen Template API supports personalized video generation via {{variable_name}} placeholders.
Syntax in script:
{{first_name}}, welcome to {{company_name}}!
Your plan: {{plan_name}} — renews on {{renewal_date}}.
| Variable type | Use case |
|---|---|
text |
Names, dates, plan names, any dynamic text |
image |
Logo, product shot per recipient |
video |
Personalized intro clip |
audio |
Custom greeting |
avatar |
Different avatar per recipient |
Naming convention: always {{snake_case}} — no spaces, no camelCase.
heygen_format_script automatically detects {{variables}} and lists them in the output. All variables in the script must be declared in HeyGen Template API before generating — undeclared variables render as literal {{variable_name}} text on screen.
⚠️ Community-verified only — NOT in official HeyGen documentation. Works with Custom Voice Clones, ElevenLabs, OpenAI Voices only. Test with a short scene before applying to a full video.
<prosody rate="x-slow">...</prosody> <!-- x-slow, slow, medium, fast, x-fast -->
<prosody pitch="high">...</prosody> <!-- x-low, low, medium, high, x-high -->
<prosody volume="loud">...</prosody> <!-- silent, x-soft, soft, medium, loud, x-loud -->
<emphasis level="strong">...</emphasis> <!-- strong, moderate, reduced -->
<p>...</p> <!-- Paragraph pause (~400-800ms) -->
<s>...</s> <!-- Sentence pause (~200-400ms) -->
Not supported: <phoneme>, <audio>, <lang> (partial).
Use <emphasis> instead of ALL-CAPS for emphasis — more portable, cleaner script. Only use prosody tags when the user explicitly opts in; always include the community disclaimer in output when these tags are used.
Each scene needs:
/vidcraft:avatar-selector)---
scene_id: 01-hook
heygen_avatar: "Anna_Professional_Front"
voice: "en-US-JennyNeural"
voice_director: "Calm"
background: "office-modern.jpg"
avatar_position: "left"
speed: 1.0
motion_prompt: "Nods gently to emphasize agreement."
---
# Scene 01 — Hook
In the next three minutes I'll show you how to install and configure
the OXID Gallery plugin in under ten minutes.
[pause 0.5s]
Ready? Let's go.
| Constraint | Detail |
|---|---|
| Max ~1,000 characters per slide | Slide-based; split at sentence boundary if exceeded. |
| Max slides per video | 150 (PowerPoint import: also 150 slides). |
| Languages | 130+ supported. |
| Slide-based scene structure | 1 scene typically maps to 1 slide. |
Embed inline gesture tags in script text to trigger avatar animations:
[gesture:nod] — Nod (agreement)
[gesture:headyes] — Head up/down twice
[gesture:headno] — Head left/right (disagreement)
[gesture:eyebrowsup] — Raised eyebrows (surprise/emphasis)
[gesture:increase] — Arm gesture for growth/expansion
Example: "We are seeing [gesture:increase] huge growth this quarter."
Important: Gesture tags are for Express-1 avatars only. Express-2 generates gestures automatically — do not add gesture tags to Express-2 scripts.
Synthesia released Express-2 — a Diffusion Transformer-based model that changes how gestures and expressions work.
| Feature | Express-1 | Express-2 |
|---|---|---|
| Gestures | Manual [gesture:tag] syntax |
Automatic from script context |
| Expressions | Sentiment-driven | Full-body co-speech gestures |
| Body language | Upper-body only | Full-body movement |
| Script requirements | Explicit gesture tags | Strong verbs + concrete actions |
Writing for Express-2: Use active, concrete language — passive/abstract scripts produce no gestures (avatar appears stiff). Example: "Click the button" (active, gestures triggered) vs. "The button should be clicked" (passive, stiff avatar).
Each Synthesia slide needs:
Split when over 1,000 characters:
| Scene Type | Recommended Layout |
|---|---|
| Intro / Outro | Avatar center, branded background |
| Explanation | Avatar left, key points right |
| Screencast | Screen recording full, avatar overlay corner |
| Comparison | Split screen, before/after |
| Summary | Text-only with bullet points |
| CTA | Avatar center, CTA text overlay |
---
scene_id: 01-hook
synthesia_avatar: "Mia_Casual"
voice: "en-US-Mia"
layout: "avatar-left-text-right"
background: "solid-color-#2563EB"
text_overlay: "Install OXID Gallery"
media: ""
---
# Slide 01 — Hook
Want to add a modern gallery feature to your OXID shop?
In the next few minutes I'll show you how.
For large projects it can make sense to render the same episodes on both platforms:
VidCraft makes this easy: source script stays the same, only the engineer skills (heygen-engineer vs. synthesia-engineer) produce platform-specific outputs.
The skill /vidcraft:avatar-selector recommends avatars based on:
| Criterion | Influence |
|---|---|
| Audience | Demographics, industry, age, language |
| Video type | Tutorial = factual; marketing = energetic |
| Brand | Predefined personas, voice IDs |
| Language | Native voice match |
| Platform | Platform-specific avatar IDs |
| Avatar generation | Express-1 vs. Express-2 for Synthesia; Avatar IV for HeyGen |
Recommended avatars for "OXID Gallery Tutorial" (HeyGen, EN):
1. Anna_Professional_Front (HeyGen Avatar IV)
- Persona: factual, trustworthy
- Voice: en-US-JennyNeural (Sonnet-3.0 native)
- Voice Director: Calm
- Best for: tutorials, trainings, B2B
- Motion Prompt suggestion: "Nods gently to emphasize agreement."
2. Marcus_Casual_Side (HeyGen)
- Persona: relaxed, approachable
- Voice: en-US-GuyNeural
- Voice Director: Casual
- Best for: onboarding, how-to, explainers
Recommendation: Anna_Professional_Front
Reason: tutorial type + technical audience + B2B context.
These behaviors are the same on both platforms:
validate_platform_limits (MCP tool) checks scripts against constraints:
/vidcraft:pre-generation-check oxid-gallery-tutorial 01-installation
Output on violation:
❌ HeyGen Validation Failed
Scene 07: 2 backgrounds detected
→ Background "office.jpg" AND "screen-recording.mp4"
→ Fix: split scene or remove one background
Scene 12: Avatar switch inside scene
→ "Anna_Professional" → "Marcus_Casual"
→ Fix: split scene
⚠️ Gate 10: SSML prosody tags found but no Custom Voice set
→ Prosody tags require a Custom Voice (clone, ElevenLabs, OpenAI)
⚠️ Gate 11: Undeclared variables: {{company_name}}, {{plan_name}}
→ Declare all variables in HeyGen Template API before generating