Talking Head audio-driven fal.ai

WAN 2.2 Speech-to-Video

WAN 2.2 Speech-to-Video costs $1.20/clip on FairStack — a audio-driven model for Talking head video creation, Podcast video generation, Virtual presenter content. No subscription required. Pay per generation with full REST API access. FairStack applies a transparent 20% margin on infrastructure cost so you always see the real price.

FairStack price
$1.20/clip

What is WAN 2.2 Speech-to-Video?

WAN 2.2 Speech-to-Video is Alibaba's talking head generation model that creates realistic video from audio input and a reference portrait image. The model synthesizes natural facial expressions, lip movements, head motion, and subtle micro-expressions synchronized to the provided speech audio, producing convincing presenter-style video. The model generates natural-looking talking head content that goes beyond basic lip sync, adding head tilts, eye movements, brow expressions, and other facial dynamics that make the output more lifelike. Audio-to-video synchronization is tight, with lip movements closely tracking the phonemes in the speech input. Compared to basic lip sync models that only animate the mouth area, WAN 2.2 Speech-to-Video generates fuller facial animation for more engaging output. Against text-to-video talking head models, the audio-driven approach preserves the speaker's actual voice and delivery. Best suited for talking head video creation, podcast video generation, and virtual presenter content where realistic facial animation synchronized to speech audio creates convincing presenter videos. Available on FairStack at infrastructure cost plus a 20% platform fee.

Key Features

Speech-driven video generation
Natural facial expressions
Audio-driven lip sync

What are WAN 2.2 Speech-to-Video's strengths?

Realistic talking head generation
Good audio synchronization

What are WAN 2.2 Speech-to-Video's limitations?

Requires both image and audio input
Limited to talking head content

What is WAN 2.2 Speech-to-Video best for?

Talking head video creation Podcast video generation Virtual presenter content

How much does WAN 2.2 Speech-to-Video cost?

Metric
FairStack
Details
Price per generation
$1.20
Includes 20% margin
Per-second rate
$0.2000/sec
Billed per second of output
Subscription
None
Pay per generation only

How do I use the WAN 2.2 Speech-to-Video API?

curl
curl -X POST https://api.fairstack.ai/v1/generations/talkingHead \
  -H "Authorization: Bearer $FAIRSTACK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "wan-2-2-speech-to-video",
    "prompt": "Your prompt here"
  }'
Python
import requests

response = requests.post(
    "https://api.fairstack.ai/v1/generations/talkingHead",
    headers={
        "Authorization": f"Bearer {FAIRSTACK_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "model": "wan-2-2-speech-to-video",
        "prompt": "Your prompt here",
    },
)

result = response.json()
print(result["url"])
Node.js
const response = await fetch(
  "https://api.fairstack.ai/v1/generations/talkingHead",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.FAIRSTACK_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: "wan-2-2-speech-to-video",
      prompt: "Your prompt here",
    }),
  }
);

const result = await response.json();
console.log(result.url);

What parameters does WAN 2.2 Speech-to-Video support?

Parameter
Type
Default
Details
image_url
string
audio_url
string

Frequently Asked Questions

How much does WAN 2.2 Speech-to-Video cost?

WAN 2.2 Speech-to-Video costs $1.20/clip on FairStack as of 2026-03-23. This price includes FairStack's transparent 20% margin on infrastructure cost. No subscription or monthly fee — you pay per generation only. Minimum deposit is $1.

What is WAN 2.2 Speech-to-Video and what is it best for?

WAN 2.2 Speech-to-Video is Alibaba's talking head generation model that creates realistic video from audio input and a reference portrait image. The model synthesizes natural facial expressions, lip movements, head motion, and subtle micro-expressions synchronized to the provided speech audio, producing convincing presenter-style video. The model generates natural-looking talking head content that goes beyond basic lip sync, adding head tilts, eye movements, brow expressions, and other facial dynamics that make the output more lifelike. Audio-to-video synchronization is tight, with lip movements closely tracking the phonemes in the speech input. Compared to basic lip sync models that only animate the mouth area, WAN 2.2 Speech-to-Video generates fuller facial animation for more engaging output. Against text-to-video talking head models, the audio-driven approach preserves the speaker's actual voice and delivery. Best suited for talking head video creation, podcast video generation, and virtual presenter content where realistic facial animation synchronized to speech audio creates convincing presenter videos. Available on FairStack at infrastructure cost plus a 20% platform fee. WAN 2.2 Speech-to-Video is best for Talking head video creation, Podcast video generation, Virtual presenter content. Available via FairStack's REST API with curl, Python, and Node.js SDKs.

Does WAN 2.2 Speech-to-Video have an API?

Yes. WAN 2.2 Speech-to-Video is available via FairStack's REST API at api.fairstack.ai. Send a POST request to /v1/generations/talkingHead with your API key and prompt. Works with curl, Python requests, Node.js fetch, and any HTTP client. No SDK installation required.

How does WAN 2.2 Speech-to-Video compare to other talking head models?

WAN 2.2 Speech-to-Video excels at Talking head video creation, Podcast video generation, Virtual presenter content. It is a audio-driven model priced at $1.20/clip on FairStack. Key strengths: Realistic talking head generation, Good audio synchronization. Compare all talking head models at fairstack.ai/models.

What makes WAN 2.2 Speech-to-Video stand out from other video models?

WAN 2.2 Speech-to-Video is distinguished by realistic talking head generation and good audio synchronization.

What are the known limitations of WAN 2.2 Speech-to-Video?

Key limitations include: requires both image and audio input; limited to talking head content. FairStack documents these transparently so you can choose the right model for your workflow.

What video capabilities does WAN 2.2 Speech-to-Video offer?

WAN 2.2 Speech-to-Video offers: speech-driven video generation; natural facial expressions; audio-driven lip sync. All capabilities are accessible through both the FairStack web interface and REST API.

Start using WAN 2.2 Speech-to-Video today

$1.20/clip. Full API access. No subscription.

Start Creating