This is Bespoken's open-source benchmarking project.
This provides a general mechanism for testing and evaluating NLP platforms.
We have conducted two tests so far:
- Voice assistant open-domain question answering - see the results
- Digital contact center ASR - see the results
We interact with the voice assistants using the Bespoken Device Service - which allow us to interact exactly as a real person would with an actual device. Read more here.
For running the tests and collecting the results, we leverage our batch testing framework:
https://gitlab.com/bespoken/batch-tester
Results are meant to published on a bi-monthly basis. The table below summarizes our tests and results to-date:
Date | Test Type | Data Set | Platforms | Results |
---|---|---|---|---|
7/26/2020 | General Knowledge | ComQA | Alexa, Google Assistant, Siri | Link |
11/20/2020 | Speech Recognition | DefinedCrowd | Amazon Connect, Google Dialogflow, Twilio Voice | Link |
The published results are viewable here:
https://benchmark.bespoken.io
We classify answers as correct or not by the presence of the answer from the dataset.
In the case where the dataset has multiple answers, if anyone is present we include it. Read more here
We take datasets from DefinedCrowd and run them through the various platforms using our Virtual Devices for IVR:
We appreciate all feedback. Open an issue to suggest additional datasets as well as improvements to our methodology.
Contact us at [email protected].