AI Audio

Voice Cloning

Clone a timbre from a reference audio, then synthesize any text (v2 endpoint, tested live)

Voice cloning = give a reference audio (the target timbre) + the text to synthesize, and get speech that "reads the text in that voice". Async task: submit for a task_id, then poll the unified status endpoint for the result.

Tested live: the examples below are real calls to api.aiclonevoicefree.com, completing in ~7s.

POST /api/v2/voice/clone

Auth: Authorization: Bearer sk_... (see Authentication)

Base parameters (common to all models)

These are all you need to get going. Each model's extra parameters are listed under Which model should I use? below.

FieldTypeRequiredNotes
textstringText to synthesize
reference_audio_urlstringReference timbre URL (wav/mp3, publicly reachable)
modelstringModel, default v2-emotion; also v1-real / v3-qwen / omni / voxcpm
speed_rationumberSpeed, default 1.0
pitch_rationumberPitch, default 0
volume_rationumberVolume, default 1.0
generate_subtitleboolProduce subtitles
subtitle_languagestringSubtitle language
compact_modeboolCompact mode (trim inter-sentence silence)
compact_max_silence_msintMax silence ms under compact mode
metadataobjectPassthrough fields, echoed back in the result

There's also voice_id (saved voice id; voice catalog is Phase 2, use reference_audio_url for now). The model-specific instructions and omni_options are documented in each model's section.

curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/clone \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a v2 voice-clone test.",
    "reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
    "model": "v2-emotion"
  }'

Response 202 Accepted (unified response format, see Tasks)

{
  "task_id": "b28f0341-3045-42e5-ad63-a5ccf4a1088e",
  "status": "pending",
  "capability": "voice",
  "action": "clone",
  "model": "v2-emotion",
  "error": null,
  "created_at": 1780704348,
  "completed_at": null
}

Getting the result

Poll GET /api/v2/tasks/{task_id}; when status becomes completed, the result fields are on the top level (no nested result wrapper):

{
  "status": "completed",
  "capability": "voice",
  "action": "clone",
  "audioUrl": "https://oss.aiclonevoicefree.com/tts/2026-06-06/xxxx.wav",
  "format": "wav",
  "degraded": false,
  "providerUsed": "tts-v2-emotion-1",
  "completed_at": 1780704355
}
Result fieldNotes
audioUrlDownloadable URL of the synthesized audio
formatwav / mp3
degradedtrue means a fallback model was used; quality may be slightly lower
providerUsedThe service actually used to generate

Which model should I use?

You wantUseIn one line
Sound as close to the reference as possiblev1-realMost stable realistic clone
Emotional deliveryv2-emotion (default)Clone with emotion
Multilingual / give reading instructionsv3-qwenSpeaks many languages, follows instructions
Design a voice from text, no recording neededomni / voxcpmJust write "young female, high pitch"

Every model needs at least text + reference_audio_url; speed/pitch/volume are common options. Below, each model only lists the extra params it needs.

v1-real — Realistic clone

Closest to the reference timbre, most stable. No extra params — just set model to v1-real in the basic example.

v2-emotion — Emotional voice (default)

The default model; output carries emotion. With no extra params, emotion follows the reference audio. To control emotion explicitly, pass an emotion object:

emotion fieldNotes
modeEmotion source: same_as_reference (follow reference, default) / vector (8-d vector) / text (text description) / reference_audio (separate emotion reference) / random
vectorFor mode=vector: eight 0–1 values, order [joy, anger, sorrow, fear, excitement, depression, surprise, calm]; components should sum to ≤ 1.4 (matches the official UI; the hard cap is 1.5 — more returns emo_vec sum exceeds 1.5)
textFor mode=text: emotion description, e.g. "say it happily"
reference_audio_urlFor mode=reference_audio: emotion reference audio URL
# emotion vector: high "joy" + a bit of "excitement"
curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/clone \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "I am so happy today!",
    "reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
    "model": "v2-emotion",
    "emotion": { "mode": "vector", "vector": [0.9, 0, 0, 0, 0.1, 0, 0, 0] }
  }'

v3-qwen — Multilingual + instructions

Speaks Chinese/English/Japanese/Korean and more; use instructions to tell it how to read. Extra param: instructions (subtitle language uses the base subtitle_language).

curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/clone \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, nice weather today.",
    "reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
    "model": "v3-qwen",
    "instructions": "read in a calm tone",
    "subtitle_language": "en"
  }'

omni — Design a voice from text

The reference timbre doesn't matter — you "shape" the voice with a list of descriptors. Extra param: omni_options (a JSON string whose key field is instruct).

Pick descriptors from the lists below and join them with commas (EN or ZH, but don't mix EN accents with ZH dialects):

  • Gender: male / female (男 / 女)
  • Age: child / teenager / young adult / middle-aged / elderly
  • Pitch: very low / low / moderate / high / very high pitch
  • Special: whisper (耳语)
  • ZH dialects: 四川话, 东北话, 河南话, 陕西话, 贵州话, 云南话 …
  • EN accents: american accent, british accent, japanese accent …

Example: female, young adult, high pitch (or ZH 女, 青年, 高音调).

curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/clone \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this voice is described, not recorded.",
    "reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
    "model": "omni",
    "omni_options": "{\"instruct\": \"female, young adult, high pitch\"}"
  }'

voxcpm — Dialect / instruction voice

Similar to omni, geared toward dialects. Extra params: omni_options (as above) or instructions.

Billing

Charged per character: each CJK character (Chinese/Japanese/Korean) = 2, others = 1.

Examples:

  • "你好世界" (4 CJK) = 4 × 2 = 8 credits
  • "Hello" (5 latin) = 5 × 1 = 5 credits
  • "你好,world" (3 CJK + 5 latin + 1 punct) ≈ 3×2 + 6×1 = 12 credits

Credits settle on completion — failed tasks are not charged (see Conventions). Uses voice credits, separate from video credits.

On this page