Cartesia Audio Cutting #100

abdulrahmanmajid · 2024-12-28T14:23:51Z

Cartesia TTS cutting every single word, no constant smooth speech

Please share the correct payload for using Cartesia

is this correct?

"synthesizer": {
                        "provider": "cartesia",
                        "stream": true,
                        "caching": true,
                        "buffer_size": 100,
                        "sampling_rate": "16000",
                        "provider_config": {
                            "voice": "Sarah",
                            "model": "sonic-english",
                            "voice_id": "694f9389-aac1-45b6-b726-9d9369183238"
                        }
                    },

The text was updated successfully, but these errors were encountered:

prateeksachan · 2024-12-31T13:35:38Z

hey @abdulrahmanmajid, thanks for reporting this. This is currently being worked on.

abdulrahmanmajid · 2024-12-31T13:39:34Z

Thanks for letting me know @prateeksachan, May I know the ETA?

prateeksachan · 2024-12-31T13:41:30Z

This should get done this week itself.

abdulrahmanmajid · 2025-01-23T22:26:51Z

Hey @prateeksachan ,

The issue I mentioned earlier is happening on Azure TTS as well. I did a fresh install, but the problem persists. Sometimes it works perfectly, but other times it becomes super choppy, breaking every single word as if it doesn’t have enough time to speak, and then it gets cut off by the next word.

Could you take a look and see if you can fix it? Or let me know where the issue might be, and I’ll try to fix it and create a PR?

prateeksachan · 2025-01-24T04:57:04Z

Hi @abdulrahmanmajid can you please share the payload you're using? I'll take a look.

abdulrahmanmajid · 2025-01-24T14:44:55Z

Sorry for the late reply @prateeksachan,

I've been testing it with 2 repos and 2 different payloads

this is the payload I used on a fresh installation on my VM

{
  "agent_config": {
      "agent_name": "Alfred",
      "agent_type": "other",
      "agent_welcome_message": "How are you doing Bruce?",
      "tasks": [
          {
              "task_type": "conversation",
              "toolchain": {
                  "execution": "parallel",
                  "pipelines": [
                      [
                          "transcriber",
                          "llm",
                          "synthesizer"
                      ]
                  ]
              },
              "tools_config": {
                  "input": {
                      "format": "wav",
                      "provider": "twilio"
                  },
                  "llm_agent": {
                      "agent_type": "simple_llm_agent",
                      "agent_flow_type": "streaming",
                      "routes": null,
                      "llm_config": {
                          "agent_flow_type": "streaming",
                          "provider": "openai",
                          "request_json": true,
                          "model": "gpt-4o-mini"
                      }
                  },
                  "output": {
                      "format": "wav",
                      "provider": "twilio"
                  },
                  "synthesizer": {
                      "audio_format": "wav",
                      "provider": "azuretts",
                      "stream": true,
                      "provider_config": {
                          "voice": "Sonia",
                          "model": "neural",
                          "language": "en-GB"
                      },
                      "buffer_size": 100.0
                  },
                  "transcriber": {
                      "encoding": "linear16",
                      "language": "en",
                      "provider": "deepgram",
                      "stream": true
                  }
              },
              "task_config": {
                  "hangup_after_silence": 30.0
              }
          }
      ]
  },
  "agent_prompts": {
      "task_1": {
          "system_prompt": "Why Do We Fall, Sir? So That We Can Learn To Pick Ourselves Up."
      }
  }
}

and then this is the 2nd payload I use for the custom repo

{
    "agent_config": {
        "agent_name": "Alfred",
        "agent_type": "other",
        "call_direction": "outbound", // or inbound will run webhook
        "inbound_phone_number": "+15028733842", // required if call_direction is "inbound"
        "timezone": "Europe/London",
        "country": "GB",
        "agent_welcome_message": "How are you doing Bruce?",
        "tasks": [
            {
                "task_type": "conversation",
                "toolchain": {
                    "execution": "parallel",
                    "pipelines": [
                        [
                            "transcriber",
                            "llm",
                            "synthesizer"
                        ]
                    ]
                },
                "tools_config": {
                    "input": {
                        "format": "wav",
                        "provider": "twilio"
                    },
                    "llm_agent": {
                        "agent_type": "simple_llm_agent",
                        "agent_flow_type": "streaming",
                        "routes": null,
                        "llm_config": {
                            "provider": "custom",
                            "base_url": "53454354354/",
                            "llm_key": "543534534543",
                            "model": "534543i",
                            "api_version": "5435435",
                            "max_tokens": 250,
                            "temperature": 0.2,
                            "top_p": 0.5,
                            "presence_penalty": 0,
                            "frequency_penalty": 0
                        }
                    },
                    "output": {
                        "format": "wav",
                        "provider": "twilio"
                    },
                    "synthesizer": {
                        "provider": "azuretts",
                        "provider_config": {
                            "voice": "AmandaMultilingual",
                            "language": "en-US",
                            "model": "Neural"
                        },
                        "stream": true,
                        "buffer_size": 150,
                        "sampling_rate": 16000,
                        "caching": true
                    },
                    "transcriber": {
                        "provider": "deepgram",
                        "model": "nova-2",
                        "language": "en",
                        "detect_language": true,
                        "stream": true,
                        "sampling_rate": 16000,
                        "encoding": "linear16",
                        "process_interim_results": "true",
                        "endpointing": 150,
                        "task": "transcribe"
                    }
                },
                "task_config": {
                    "optimize_latency": true,
                    "ambient_noise": true,
                    "ambient_noise_track": "call-center",
                    "incremental_delay": 200, // Reduced from 900ms to 200ms
                    "interruption_backoff_period": 50, // Reduced from 100ms to 50ms
                    "backchanneling": false, // Disabled to reduce overhead
                    "use_fillers": false, // Disabled to reduce latency
                    "number_of_words_for_interruption": 2
                }
            }
        ]
    },
    "agent_prompts": {
        "task_1": {
            "system_prompt": "You are {{ agent.name }}, a {{ agent.department }} representative from {{ company.name }}. You're calling about {{ company.product }}.\n\nCustomer Details:\nName: {{ customer.name }}\nAccount ID: {{ customer.account_id }}\n\nPlease maintain a professional and friendly tone throughout the conversation."
        }
    }
}

also I had a question

how exactly does backchanneling work? when the user speaks and LLM stays silent we stream words from constants?, are those words relevant to the conversation? also is it in the voice we choose? 2nd is it pulled from s3 or from local files? how does it exactly work

also could you explain how use fillers work? I cant hear the agent using fillers during the conversation when this is turned on

the ambient audio wasn't working until someone recently created the PR to fix it.

abdulrahmanmajid · 2025-01-24T14:46:25Z

and both have the same issue

sometimes it speaks perfect then randomly it will start chopping up as if you are playing a YouTube video on super slow Wi-Fi

could this be a issue with the VM? but the VM has 40gbps speeds

abdulrahmanmajid · 2025-01-24T15:44:39Z

so I've tested it with a different VM and the issue still is there, there's and issue with the transcriber aswell where it would stop listening or start transcribing gibberish even when you stay silent, as for the audio cutting issue I've attached a video and I cant attach the audio recording from Twilio but the same issue is in the recording as well

prateeksachan · 2025-01-24T16:00:16Z

can you try with ambient_noise: false.

ambient_noise, backchanneling & fillers we'd released some time back but they had issues with some voice providers.

abdulrahmanmajid · 2025-01-24T16:09:52Z

ok that fixed it pretty much but still happens occasionally, what about the other question I had?

"how exactly does backchanneling work? when the user speaks and LLM stays silent we stream words from constants?, are those words relevant to the conversation? also is it in the voice we choose? 2nd is it pulled from s3 or from local files? how does it exactly work

also could you explain how use_fillers work? I cant hear the agent using fillers during the conversation when this is turned on"

prateeksachan self-assigned this Dec 28, 2024

prateeksachan added the bug Something isn't working label Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cartesia Audio Cutting #100

Cartesia Audio Cutting #100

abdulrahmanmajid commented Dec 28, 2024

prateeksachan commented Dec 31, 2024

abdulrahmanmajid commented Dec 31, 2024

prateeksachan commented Dec 31, 2024

abdulrahmanmajid commented Jan 23, 2025 •

edited

Loading

prateeksachan commented Jan 24, 2025

abdulrahmanmajid commented Jan 24, 2025

abdulrahmanmajid commented Jan 24, 2025

abdulrahmanmajid commented Jan 24, 2025 •

edited

Loading

prateeksachan commented Jan 24, 2025

abdulrahmanmajid commented Jan 24, 2025

Cartesia Audio Cutting #100

Cartesia Audio Cutting #100

Comments

abdulrahmanmajid commented Dec 28, 2024

prateeksachan commented Dec 31, 2024

abdulrahmanmajid commented Dec 31, 2024

prateeksachan commented Dec 31, 2024

abdulrahmanmajid commented Jan 23, 2025 • edited Loading

prateeksachan commented Jan 24, 2025

abdulrahmanmajid commented Jan 24, 2025

abdulrahmanmajid commented Jan 24, 2025

abdulrahmanmajid commented Jan 24, 2025 • edited Loading

prateeksachan commented Jan 24, 2025

abdulrahmanmajid commented Jan 24, 2025

abdulrahmanmajid commented Jan 23, 2025 •

edited

Loading

abdulrahmanmajid commented Jan 24, 2025 •

edited

Loading