Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cartesia Audio Cutting #100

Open
abdulrahmanmajid opened this issue Dec 28, 2024 · 10 comments
Open

Cartesia Audio Cutting #100

abdulrahmanmajid opened this issue Dec 28, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@abdulrahmanmajid
Copy link

Cartesia TTS cutting every single word, no constant smooth speech

Please share the correct payload for using Cartesia

is this correct?

"synthesizer": {
                        "provider": "cartesia",
                        "stream": true,
                        "caching": true,
                        "buffer_size": 100,
                        "sampling_rate": "16000",
                        "provider_config": {
                            "voice": "Sarah",
                            "model": "sonic-english",
                            "voice_id": "694f9389-aac1-45b6-b726-9d9369183238"
                        }
                    },
@prateeksachan prateeksachan self-assigned this Dec 28, 2024
@prateeksachan prateeksachan added the bug Something isn't working label Dec 28, 2024
@prateeksachan
Copy link
Member

hey @abdulrahmanmajid, thanks for reporting this. This is currently being worked on.

@abdulrahmanmajid
Copy link
Author

Thanks for letting me know @prateeksachan, May I know the ETA?

@prateeksachan
Copy link
Member

This should get done this week itself.

@abdulrahmanmajid
Copy link
Author

abdulrahmanmajid commented Jan 23, 2025

Hey @prateeksachan ,

The issue I mentioned earlier is happening on Azure TTS as well. I did a fresh install, but the problem persists. Sometimes it works perfectly, but other times it becomes super choppy, breaking every single word as if it doesn’t have enough time to speak, and then it gets cut off by the next word.

Could you take a look and see if you can fix it? Or let me know where the issue might be, and I’ll try to fix it and create a PR?

@prateeksachan
Copy link
Member

Hi @abdulrahmanmajid can you please share the payload you're using? I'll take a look.

@abdulrahmanmajid
Copy link
Author

Sorry for the late reply @prateeksachan,

I've been testing it with 2 repos and 2 different payloads

this is the payload I used on a fresh installation on my VM

{
  "agent_config": {
      "agent_name": "Alfred",
      "agent_type": "other",
      "agent_welcome_message": "How are you doing Bruce?",
      "tasks": [
          {
              "task_type": "conversation",
              "toolchain": {
                  "execution": "parallel",
                  "pipelines": [
                      [
                          "transcriber",
                          "llm",
                          "synthesizer"
                      ]
                  ]
              },
              "tools_config": {
                  "input": {
                      "format": "wav",
                      "provider": "twilio"
                  },
                  "llm_agent": {
                      "agent_type": "simple_llm_agent",
                      "agent_flow_type": "streaming",
                      "routes": null,
                      "llm_config": {
                          "agent_flow_type": "streaming",
                          "provider": "openai",
                          "request_json": true,
                          "model": "gpt-4o-mini"
                      }
                  },
                  "output": {
                      "format": "wav",
                      "provider": "twilio"
                  },
                  "synthesizer": {
                      "audio_format": "wav",
                      "provider": "azuretts",
                      "stream": true,
                      "provider_config": {
                          "voice": "Sonia",
                          "model": "neural",
                          "language": "en-GB"
                      },
                      "buffer_size": 100.0
                  },
                  "transcriber": {
                      "encoding": "linear16",
                      "language": "en",
                      "provider": "deepgram",
                      "stream": true
                  }
              },
              "task_config": {
                  "hangup_after_silence": 30.0
              }
          }
      ]
  },
  "agent_prompts": {
      "task_1": {
          "system_prompt": "Why Do We Fall, Sir? So That We Can Learn To Pick Ourselves Up."
      }
  }
}

and then this is the 2nd payload I use for the custom repo

{
    "agent_config": {
        "agent_name": "Alfred",
        "agent_type": "other",
        "call_direction": "outbound", // or inbound will run webhook
        "inbound_phone_number": "+15028733842", // required if call_direction is "inbound"
        "timezone": "Europe/London",
        "country": "GB",
        "agent_welcome_message": "How are you doing Bruce?",
        "tasks": [
            {
                "task_type": "conversation",
                "toolchain": {
                    "execution": "parallel",
                    "pipelines": [
                        [
                            "transcriber",
                            "llm",
                            "synthesizer"
                        ]
                    ]
                },
                "tools_config": {
                    "input": {
                        "format": "wav",
                        "provider": "twilio"
                    },
                    "llm_agent": {
                        "agent_type": "simple_llm_agent",
                        "agent_flow_type": "streaming",
                        "routes": null,
                        "llm_config": {
                            "provider": "custom",
                            "base_url": "53454354354/",
                            "llm_key": "543534534543",
                            "model": "534543i",
                            "api_version": "5435435",
                            "max_tokens": 250,
                            "temperature": 0.2,
                            "top_p": 0.5,
                            "presence_penalty": 0,
                            "frequency_penalty": 0
                        }
                    },
                    "output": {
                        "format": "wav",
                        "provider": "twilio"
                    },
                    "synthesizer": {
                        "provider": "azuretts",
                        "provider_config": {
                            "voice": "AmandaMultilingual",
                            "language": "en-US",
                            "model": "Neural"
                        },
                        "stream": true,
                        "buffer_size": 150,
                        "sampling_rate": 16000,
                        "caching": true
                    },
                    "transcriber": {
                        "provider": "deepgram",
                        "model": "nova-2",
                        "language": "en",
                        "detect_language": true,
                        "stream": true,
                        "sampling_rate": 16000,
                        "encoding": "linear16",
                        "process_interim_results": "true",
                        "endpointing": 150,
                        "task": "transcribe"
                    }
                },
                "task_config": {
                    "optimize_latency": true,
                    "ambient_noise": true,
                    "ambient_noise_track": "call-center",
                    "incremental_delay": 200, // Reduced from 900ms to 200ms
                    "interruption_backoff_period": 50, // Reduced from 100ms to 50ms
                    "backchanneling": false, // Disabled to reduce overhead
                    "use_fillers": false, // Disabled to reduce latency
                    "number_of_words_for_interruption": 2
                }
            }
        ]
    },
    "agent_prompts": {
        "task_1": {
            "system_prompt": "You are {{ agent.name }}, a {{ agent.department }} representative from {{ company.name }}. You're calling about {{ company.product }}.\n\nCustomer Details:\nName: {{ customer.name }}\nAccount ID: {{ customer.account_id }}\n\nPlease maintain a professional and friendly tone throughout the conversation."
        }
    }
}

also I had a question

how exactly does backchanneling work? when the user speaks and LLM stays silent we stream words from constants?, are those words relevant to the conversation? also is it in the voice we choose? 2nd is it pulled from s3 or from local files? how does it exactly work

also could you explain how use fillers work? I cant hear the agent using fillers during the conversation when this is turned on

the ambient audio wasn't working until someone recently created the PR to fix it.

@abdulrahmanmajid
Copy link
Author

and both have the same issue

sometimes it speaks perfect then randomly it will start chopping up as if you are playing a YouTube video on super slow Wi-Fi

could this be a issue with the VM? but the VM has 40gbps speeds

@abdulrahmanmajid
Copy link
Author

abdulrahmanmajid commented Jan 24, 2025

so I've tested it with a different VM and the issue still is there, there's and issue with the transcriber aswell where it would stop listening or start transcribing gibberish even when you stay silent, as for the audio cutting issue I've attached a video and I cant attach the audio recording from Twilio but the same issue is in the recording as well

@prateeksachan
Copy link
Member

can you try with ambient_noise: false.

ambient_noise, backchanneling & fillers we'd released some time back but they had issues with some voice providers.

@abdulrahmanmajid
Copy link
Author

ok that fixed it pretty much but still happens occasionally, what about the other question I had?

"how exactly does backchanneling work? when the user speaks and LLM stays silent we stream words from constants?, are those words relevant to the conversation? also is it in the voice we choose? 2nd is it pulled from s3 or from local files? how does it exactly work

also could you explain how use_fillers work? I cant hear the agent using fillers during the conversation when this is turned on"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants