AI Audio

Voice Convert

Re-synthesize the spoken content of one audio in the timbre of another

Voice convert (timbre transfer) = give a source audio (the spoken content to keep) + a reference audio (the target timbre), and get "the original content spoken in the target voice". Async task; poll for status after submitting.

POST /api/v2/voice/convert

Auth: Authorization: Bearer sk_...

FieldTypeRequiredNotes
source_audio_urlstringOriginal speech audio URL
reference_audio_urlstringTarget timbre reference URL
diffusion_stepsintInference steps — higher = better but slower
length_adjustnumberDuration scaling
inference_cfg_ratenumberCFG strength
auto_f0_adjustboolAuto pitch adjust
pitch_shiftintSemitone shift
return_formatstringwav / mp3
enable_separationboolSeparate vocals first
vocals_gainnumberVocals gain
accompaniment_gainnumberAccompaniment gain
metadataobjectPassthrough fields
curl -X POST https://api.aiclonevoicefree.com/api/v2/voice/convert \
  -H "Authorization: Bearer sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "source_audio_url": "https://oss.aiclonevoicefree.com/noise_removal/1751766508261_mandarin_speech_16kHz.wav",
    "reference_audio_url": "https://oss.aiclonevoicefree.com/trump.wav",
    "return_format": "wav"
  }'

Response 202 Accepted

{
  "task_id": "99ae1ea9-01e3-4b15-9f89-d35c280f44b1",
  "status": "pending",
  "capability": "voice",
  "action": "vocal-conversion",
  "model": "voice-convert",
  "created_at": 1780704377
}

Getting the result

This is an async task: with the task_id above, poll GET /api/v2/tasks/{task_id}; when status becomes completed, result holds the converted audio:

curl https://api.aiclonevoicefree.com/api/v2/tasks/99ae1ea9-01e3-4b15-9f89-d35c280f44b1 \
  -H "Authorization: Bearer sk_your_api_key"
{
  "status": "completed",
  "capability": "voice",
  "action": "vocal-conversion",
  "audioUrl": "https://oss.aiclonevoicefree.com/...converted.wav",
  "format": "wav",
  "degraded": false,
  "providerUsed": "voice-convert-1",
  "completed_at": 1780704999
}

See Tasks for the polling cadence, status values and the unified response format.

Billing

Charged by output audio duration: 2 credits per second (duration rounded up, min 1s).

Examples:

  • ~30s output = 30 × 2 = 60 credits
  • ~12.4s output = rounded up to 13s = 13 × 2 = 26 credits

Settled on completion, failures not charged (see Conventions). Uses voice credits.

On this page