Voice Cloning
Clone a timbre from a reference audio, then synthesize any text (v2 endpoint, tested live)
Voice cloning = give a reference audio (the target timbre) + the text to synthesize, and get speech
that "reads the text in that voice". Async task: submit for a task_id, then poll
the unified status endpoint for the result.
Tested live: the examples below are real calls to
api.aiclonevoicefree.com, completing in ~7s.
POST /api/v2/voice/clone
Auth: Authorization: Bearer sk_... (see Authentication)
Base parameters (common to all models)
These are all you need to get going. Each model's extra parameters are listed under Which model should I use? below.
| Field | Type | Required | Notes |
|---|---|---|---|
text | string | ✅ | Text to synthesize |
reference_audio_url | string | ✅ | Reference timbre URL (wav/mp3, publicly reachable) |
model | string | ⬜ | Model, default v2-emotion; also v1-real / v3-qwen / omni / voxcpm |
speed_ratio | number | ⬜ | Speed, default 1.0 |
pitch_ratio | number | ⬜ | Pitch, default 0 |
volume_ratio | number | ⬜ | Volume, default 1.0 |
generate_subtitle | bool | ⬜ | Produce subtitles |
subtitle_language | string | ⬜ | Subtitle language |
compact_mode | bool | ⬜ | Compact mode (trim inter-sentence silence) |
compact_max_silence_ms | int | ⬜ | Max silence ms under compact mode |
metadata | object | ⬜ | Passthrough fields, echoed back in the result |
There's also
voice_id(saved voice id; voice catalog is Phase 2, usereference_audio_urlfor now). The model-specificinstructionsandomni_optionsare documented in each model's section.
curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/clone \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, this is a v2 voice-clone test.",
"reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
"model": "v2-emotion"
}'Response 202 Accepted (unified response format, see Tasks)
{
"task_id": "b28f0341-3045-42e5-ad63-a5ccf4a1088e",
"status": "pending",
"capability": "voice",
"action": "clone",
"model": "v2-emotion",
"error": null,
"created_at": 1780704348,
"completed_at": null
}Getting the result
Poll GET /api/v2/tasks/{task_id}; when status becomes completed, the result fields
are on the top level (no nested result wrapper):
{
"status": "completed",
"capability": "voice",
"action": "clone",
"audioUrl": "https://oss.aiclonevoicefree.com/tts/2026-06-06/xxxx.wav",
"format": "wav",
"degraded": false,
"providerUsed": "tts-v2-emotion-1",
"completed_at": 1780704355
}| Result field | Notes |
|---|---|
audioUrl | Downloadable URL of the synthesized audio |
format | wav / mp3 |
degraded | true means a fallback model was used; quality may be slightly lower |
providerUsed | The service actually used to generate |
Which model should I use?
| You want | Use | In one line |
|---|---|---|
| Sound as close to the reference as possible | v1-real | Most stable realistic clone |
| Emotional delivery | v2-emotion (default) | Clone with emotion |
| Multilingual / give reading instructions | v3-qwen | Speaks many languages, follows instructions |
| Design a voice from text, no recording needed | omni / voxcpm | Just write "young female, high pitch" |
Every model needs at least
text+reference_audio_url; speed/pitch/volume are common options. Below, each model only lists the extra params it needs.
v1-real — Realistic clone
Closest to the reference timbre, most stable. No extra params — just set model to v1-real in the basic example.
v2-emotion — Emotional voice (default)
The default model; output carries emotion. With no extra params, emotion follows the reference
audio. To control emotion explicitly, pass an emotion object:
emotion field | Notes |
|---|---|
mode | Emotion source: same_as_reference (follow reference, default) / vector (8-d vector) / text (text description) / reference_audio (separate emotion reference) / random |
vector | For mode=vector: eight 0–1 values, order [joy, anger, sorrow, fear, excitement, depression, surprise, calm]; components should sum to ≤ 1.4 (matches the official UI; the hard cap is 1.5 — more returns emo_vec sum exceeds 1.5) |
text | For mode=text: emotion description, e.g. "say it happily" |
reference_audio_url | For mode=reference_audio: emotion reference audio URL |
# emotion vector: high "joy" + a bit of "excitement"
curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/clone \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"text": "I am so happy today!",
"reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
"model": "v2-emotion",
"emotion": { "mode": "vector", "vector": [0.9, 0, 0, 0, 0.1, 0, 0, 0] }
}'v3-qwen — Multilingual + instructions
Speaks Chinese/English/Japanese/Korean and more; use instructions to tell it how to read.
Extra param: instructions (subtitle language uses the base subtitle_language).
curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/clone \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, nice weather today.",
"reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
"model": "v3-qwen",
"instructions": "read in a calm tone",
"subtitle_language": "en"
}'omni — Design a voice from text
The reference timbre doesn't matter — you "shape" the voice with a list of descriptors.
Extra param: omni_options (a JSON string whose key field is instruct).
Pick descriptors from the lists below and join them with commas (EN or ZH, but don't mix EN accents with ZH dialects):
- Gender: male / female (男 / 女)
- Age: child / teenager / young adult / middle-aged / elderly
- Pitch: very low / low / moderate / high / very high pitch
- Special: whisper (耳语)
- ZH dialects: 四川话, 东北话, 河南话, 陕西话, 贵州话, 云南话 …
- EN accents: american accent, british accent, japanese accent …
Example: female, young adult, high pitch (or ZH 女, 青年, 高音调).
curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/clone \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, this voice is described, not recorded.",
"reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
"model": "omni",
"omni_options": "{\"instruct\": \"female, young adult, high pitch\"}"
}'voxcpm — Dialect / instruction voice
Similar to omni, geared toward dialects. Extra params: omni_options (as above) or instructions.
Billing
Charged per character: each CJK character (Chinese/Japanese/Korean) = 2, others = 1.
Examples:
- "你好世界" (4 CJK) = 4 × 2 = 8 credits
- "Hello" (5 latin) = 5 × 1 = 5 credits
- "你好,world" (3 CJK + 5 latin + 1 punct) ≈ 3×2 + 6×1 = 12 credits
Credits settle on completion — failed tasks are not charged (see Conventions). Uses voice credits, separate from video credits.