Voice Cloning

Clone a timbre from a reference audio, then synthesize any text (v2 endpoint, tested live)

Voice cloning = give a reference audio (the target timbre) + the text to synthesize, and get speech that "reads the text in that voice". Async task: submit for a task_id, then poll the unified status endpoint for the result.

Tested live: the examples below are real calls to api.aiclonevoicefree.com, completing in ~7s.

`POST /api/v2/voice/clone`

Auth: Authorization: Bearer sk_... (see Authentication)

Base parameters (common to all models)

These are all you need to get going. Each model's extra parameters are listed under Which model should I use? below.

Field	Type	Required	Notes
`text`	string	✅	Text to synthesize
`reference_audio_url`	string	✅	Reference timbre URL (wav/mp3, publicly reachable)
`model`	string	⬜	Model, default `v2-emotion`; also `v1-real` / `v3-qwen` / `omni` / `voxcpm`
`speed_ratio`	number	⬜	Speed, default 1.0
`pitch_ratio`	number	⬜	Pitch, default 0
`volume_ratio`	number	⬜	Volume, default 1.0
`generate_subtitle`	bool	⬜	Produce subtitles
`subtitle_language`	string	⬜	Subtitle language
`compact_mode`	bool	⬜	Compact mode (trim inter-sentence silence)
`compact_max_silence_ms`	int	⬜	Max silence ms under compact mode
`metadata`	object	⬜	Passthrough fields, echoed back in the result

There's also voice_id (saved voice id; voice catalog is Phase 2, use reference_audio_url for now). The model-specific instructions and omni_options are documented in each model's section.

curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/clone \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a v2 voice-clone test.",
    "reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
    "model": "v2-emotion"
  }'

Response 202 Accepted (unified response format, see Tasks)

{
  "task_id": "b28f0341-3045-42e5-ad63-a5ccf4a1088e",
  "status": "pending",
  "capability": "voice",
  "action": "clone",
  "model": "v2-emotion",
  "error": null,
  "created_at": 1780704348,
  "completed_at": null
}

Getting the result

Poll GET /api/v2/tasks/{task_id}; when status becomes completed, the result fields are on the top level (no nested result wrapper):

{
  "status": "completed",
  "capability": "voice",
  "action": "clone",
  "audioUrl": "https://oss.aiclonevoicefree.com/tts/2026-06-06/xxxx.wav",
  "format": "wav",
  "degraded": false,
  "providerUsed": "tts-v2-emotion-1",
  "completed_at": 1780704355
}

Result field	Notes
`audioUrl`	Downloadable URL of the synthesized audio
`format`	`wav` / `mp3`
`degraded`	`true` means a fallback model was used; quality may be slightly lower
`providerUsed`	The service actually used to generate

Which model should I use?

You want	Use	In one line
Sound as close to the reference as possible	`v1-real`	Most stable realistic clone
Emotional delivery	`v2-emotion` (default)	Clone with emotion
Multilingual / give reading instructions	`v3-qwen`	Speaks many languages, follows instructions
Design a voice from text, no recording needed	`omni` / `voxcpm`	Just write "young female, high pitch"

Every model needs at least text + reference_audio_url; speed/pitch/volume are common options. Below, each model only lists the extra params it needs.

v1-real — Realistic clone

Closest to the reference timbre, most stable. No extra params — just set model to v1-real in the basic example.

v2-emotion — Emotional voice (default)

The default model; output carries emotion. With no extra params, emotion follows the reference audio. To control emotion explicitly, pass an emotion object:

`emotion` field	Notes
`mode`	Emotion source: `same_as_reference` (follow reference, default) / `vector` (8-d vector) / `text` (text description) / `reference_audio` (separate emotion reference) / `random`
`vector`	For `mode=vector`: eight 0–1 values, order `[joy, anger, sorrow, fear, excitement, depression, surprise, calm]`; components should sum to ≤ 1.4 (matches the official UI; the hard cap is 1.5 — more returns `emo_vec sum exceeds 1.5`)
`text`	For `mode=text`: emotion description, e.g. "say it happily"
`reference_audio_url`	For `mode=reference_audio`: emotion reference audio URL

# emotion vector: high "joy" + a bit of "excitement"
curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/clone \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "I am so happy today!",
    "reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
    "model": "v2-emotion",
    "emotion": { "mode": "vector", "vector": [0.9, 0, 0, 0, 0.1, 0, 0, 0] }
  }'

v3-qwen — Multilingual + instructions

Speaks Chinese/English/Japanese/Korean and more; use instructions to tell it how to read. Extra param: instructions (subtitle language uses the base subtitle_language).

curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/clone \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, nice weather today.",
    "reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
    "model": "v3-qwen",
    "instructions": "read in a calm tone",
    "subtitle_language": "en"
  }'

omni — Design a voice from text

The reference timbre doesn't matter — you "shape" the voice with a list of descriptors. Extra param: omni_options (a JSON string whose key field is instruct).

Pick descriptors from the lists below and join them with commas (EN or ZH, but don't mix EN accents with ZH dialects):

Gender: male / female (男 / 女)
Age: child / teenager / young adult / middle-aged / elderly
Pitch: very low / low / moderate / high / very high pitch
Special: whisper (耳语)
ZH dialects: 四川话, 东北话, 河南话, 陕西话, 贵州话, 云南话 …
EN accents: american accent, british accent, japanese accent …

Example: female, young adult, high pitch (or ZH 女, 青年, 高音调).

curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/clone \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this voice is described, not recorded.",
    "reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
    "model": "omni",
    "omni_options": "{\"instruct\": \"female, young adult, high pitch\"}"
  }'

voxcpm — Dialect / instruction voice

Similar to omni, geared toward dialects. Extra params: omni_options (as above) or instructions.

Billing

Charged per character: each CJK character (Chinese/Japanese/Korean) = 2, others = 1.

Examples:

"你好世界" (4 CJK) = 4 × 2 = 8 credits
"Hello" (5 latin) = 5 × 1 = 5 credits
"你好，world" (3 CJK + 5 latin + 1 punct) ≈ 3×2 + 6×1 = 12 credits

Credits settle on completion — failed tasks are not charged (see Conventions). Uses voice credits, separate from video credits.

On this page