Voice Convert
Re-synthesize the spoken content of one audio in the timbre of another
Voice convert (timbre transfer) = give a source audio (the spoken content to keep) + a reference audio (the target timbre), and get "the original content spoken in the target voice". Async task; poll for status after submitting.
POST /api/v2/voice/convert
Auth: Authorization: Bearer sk_...
| Field | Type | Required | Notes |
|---|---|---|---|
source_audio_url | string | ✅ | Original speech audio URL |
reference_audio_url | string | ✅ | Target timbre reference URL |
diffusion_steps | int | ⬜ | Inference steps — higher = better but slower |
length_adjust | number | ⬜ | Duration scaling |
inference_cfg_rate | number | ⬜ | CFG strength |
auto_f0_adjust | bool | ⬜ | Auto pitch adjust |
pitch_shift | int | ⬜ | Semitone shift |
return_format | string | ⬜ | wav / mp3 |
enable_separation | bool | ⬜ | Separate vocals first |
vocals_gain | number | ⬜ | Vocals gain |
accompaniment_gain | number | ⬜ | Accompaniment gain |
metadata | object | ⬜ | Passthrough fields |
curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/convert \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"source_audio_url": "https://oss.aiclonevoicefree.com/noise_removal/1751766508261_mandarin_speech_16kHz.wav",
"reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
"return_format": "wav"
}'Response 202 Accepted
{
"task_id": "99ae1ea9-01e3-4b15-9f89-d35c280f44b1",
"status": "pending",
"capability": "voice",
"action": "vocal-conversion",
"model": "voice-convert",
"created_at": 1780704377
}Getting the result
This is an async task: with the task_id above, poll GET /api/v2/tasks/{task_id}; when
status becomes completed, result holds the converted audio:
curl https://api.aiclonevoicefree.com/api/v2/tasks/99ae1ea9-01e3-4b15-9f89-d35c280f44b1 \
-H "Authorization: Bearer sk_your_api_key"{
"status": "completed",
"capability": "voice",
"action": "vocal-conversion",
"audioUrl": "https://oss.aiclonevoicefree.com/...converted.wav",
"format": "wav",
"degraded": false,
"providerUsed": "voice-convert-1",
"completed_at": 1780704999
}See Tasks for the polling cadence, status values and the unified response format.
Billing
Charged by output audio duration: 2 credits per second (duration rounded up, min 1s).
Examples:
- ~30s output = 30 × 2 = 60 credits
- ~12.4s output = rounded up to 13s = 13 × 2 = 26 credits
Settled on completion, failures not charged (see Conventions). Uses voice credits.