Polyfill integration #79
Replies: 7 comments 20 replies
-
Hi James, Great write up.
I definitely think aligning to the spec is the way to go. My only hesitancy is that we're making the assumption that the spec will ever make it out of draft. I'm glad that Edge has implemented it too now but given that it was likely trivial due to Microsoft having Azure I don't know how much of an indication that is that we will see it supported by, for example, Firefox, Chromium derivatives any time soon. I don't want to come across as overly negative/pessimistic, but audio in the browser is a notoriously complex and slow-moving area. Case and point: the fact that createScriptProcessor was deprecated over 5 years ago but it's successor is still in working draft phase and with very poor browser support leaving no reliable option for developers. Given the above, if it turns out there will be a large amount of effort involved in maintaining spec-consistent APIs we may want to look at the cost-benefit assuming others share my low expectations of anything like full support in the next few years at least (and assuming it doesn't change in the process). For now, I can't think of a better way to standardise so I guess we just cross our fingers :)
I'm a bit hesitant about patching global objects- I don't think it's too uncommon for polyfills to be exposed via a scoped/different package name (e.g. bluebird promises). Assuming we were spec-compliant, the benefits you're talking about around implementation not needing to change would still pretty-much apply, you'd just import/reference a different class.
I think the issue is that I've called my library a polyfill whereas it's more of a general-purpose library. I wanted to create a simple API that makes speech recognition easy to use. In a way, it's essentially the same as this library except not react-based. I think the most sensible thing might be to split my repo up and move the "friendly" API downstream so that we can maintain a general-purpose, spec-compliant polyfill that anyone can use to patch their speech-recognition implementations. In the same vein: I think it's fair to assume that other polyfills will also attempt to mirror the spec so hopefully adapters won't be required.
I think this option works better personally - although I get that it's probably more work from your point of view. I just can't think of any elegant way to implicitly fetch config etc without polluting the global namespace or introducing needless complexity. It's yet another concession in the "make the implementations identical" objective, but again it's only going to be one line at the top of a file probably so hopefully not a huge issue. In terms of next steps, are you happy for me to go away and refactor my repo to confirm to the parts of the spec listed above? I think it will make it easier to by not just this library but any similar ones that might pop up in the future. |
Beta Was this translation helpful? Give feedback.
-
I agree that the draft spec can't be completely relied upon. That said, I think it's the best option we have for now - it's a well-documented API out of the box and has the potential to become standard. Though I'm not sure how that could happen until the big cloud providers open up access to their speech recognition services, or perhaps a company like Mozilla builds their own and shares it with other browser vendors. The W3C group did briefly toy with the idea of using streaming APIs (rather like what you're doing in your polyfill) in SpeechRecognition as well as making the service URI configurable (this was apparently supported in Chrome for a while before being dropped). They seemed to abandon the idea to avoid drastically rewriting the spec (their discussion here). If they ever return to this idea, good chance the spec would change a fair bit.
Makes sense to me. I think it's reasonable to support the spec in the polyfill and build more user-friendly APIs around it elsewhere.
Either approach works for me and would present equally simple setup for consumers. In terms of writing an MVP polyfill that we can test with this library early on, some of the properties could be simplified or constrained:
I'll leave it up to you to decide how to handle consumers setting unsupported values for these properties. If the given values are invalid, you could fail gracefully and fallback to sensible supported values, or you could fail loudly and throw an error to force the consumer to provide a supported value. I'm not too fussed either way - any property constraints could be documented in your polyfill's README.
Yes, that would be great. While you do that, I can make the SpeechRecognition object configurable in this library so your polyfill can be passed in. |
Beta Was this translation helpful? Give feedback.
-
Really interesting, thanks
Veering towards the former but we'll see I think realistically it'll be January before I have a first pass ready to share. It's exciting to be working on this, Gartner predicts a huge proliferation of browser-based voice-enabled apps in 2021 as voice search overtook text search for the first time this year. The web is absolutely not ready for that right now and being part of laying the foundations in preparation for wider adoption is a great place to be. Maybe you'd be up for looking at speech synthesis afterwards too? :) Anyway I'll get started and let you know on here when I have something to show. cheers. |
Beta Was this translation helpful? Give feedback.
-
How's the AWS Transcribe polyfill coming along, @ceuk ? I've just made a release with polyfill support - we've currently got one polyfill working (more or less) for Azure Cognitive Services. |
Beta Was this translation helpful? Give feedback.
-
Right, apologies for the delay. It's almost at a good first version I think - just need to write some tests. Thought it would be good to get some eyes on it while I do. The polyfill now behaves more-or-less identically to the native implementation with the main caveat being that you have to import and use the polyfill directly and that when you instantiate it you have to provide some AWS config (region and identity pool ID). As you advised in your previous post, I've been unable to support some of the properties and have chosen to not support others in this version. You can see the full support table here Most should be self-explanatory, but the one that confused me was In terms of future versions, the best candidates for implementing next are I've also re-written it in Typescript so we have definitions out of the box now too which is nice. Anyway, let me know your thoughts. Cheers |
Beta Was this translation helpful? Give feedback.
-
Yes, exactly.
Aha, totally missed that. That's a shame.
Off the top of my head, a couple of ways I'd try doing this:
|
Beta Was this translation helpful? Give feedback.
-
Hi James,
That all makes sense.
I already have an
[isSupported](https://github.com/ceuk/speech-recognition-aws-polyfill/blob/master/src/recognizers/aws.ts#L24)
prop but happy to rename.
The language and isomorphism stuff should both be pretty trivial so may
as well stick them in the next version in addition to the Safari issue.
With regards to the quality of the recognition I might have messed around
with the transcoding stuff since you last tested. I did a comparison
again when I was doing the continuous stuff and it seems a lot better now
so might be worth seeing what you think.
I'll try get all the above done as soon as I can but probably won't be
for a couple of weeks. Just need a day free on the weekend at some point
and should be able to crack through it.
I'll let you know when it's ready anyway
|
Beta Was this translation helpful? Give feedback.
-
Problem
react-speech-recognition
depends on the SpeechRecognition part of the Web Speech API (W3C spec) to collect audio from the microphone and transcribe it. Unfortunately, this is an experimental API that is almost exclusively implemented by Google browsers, which make calls to a Google speech recognition service. For browsers that aren't owned by giant tech companies that can afford to use their own speech recognition services, this isn't an option without money being exchanged behind closed doors. The frustration this has caused the developers of Chromium-based browsers in particular is nicely captured in this post. This limited support has two outcomes:Solution
Ideally,
react-speech-recognition
could be utilised on any major browser to enable voice-driven web experiences everywhere and encourage more developers to experiment with this technology. Furthermore, any audio data produced by the users of such experiences should be processed by the owners of the web apps, rather than Google.One solution is to polyfill SpeechRecognition with implementations that use popular cloud services to perform the audio processing. In other words, fill in that missing feature on browsers that don't support SpeechRecognition. While this means that web developers will need to pay for and deploy their own speech recognition services, it will free them from the existing constraints of the API. At a high level, this will entail the following:
react-speech-recognition
so that the transcription can be injected into the developer's React appA starting point
A first pass at an AWS Transcribe polyfill has been created here. This handles all the interactions with the AWS SDK and the WebSockets-based audio streaming, presenting it in a simple API.
Given that AWS is the cloud provider of choice for most developers, it seems reasonable for this polyfill to be the first to be integrated with
react-speech-recognition
.How to integrate polyfills with React Speech Recognition?
This is where this discussion comes in. My thoughts are on this are as follows...
Reusing the W3C spec
One decision to make is what interface should be used for communication between this library and a SpeechRecognition polyfill. I'm of the opinion that, if possible, we should utilise the one that's already been established and well documented by Mozilla.
react-speech-recognition
is already tightly coupled to this API and its own interface reflects this (e.g. the options for configuring the processing of "interim results" and "continuous mode"). If other polyfill authors come along, they will have a well-defined standard by the W3C (albeit a draft that is heavily influenced by Google) to base their implementations on.If a polyfill (a) implemented the existing spec, (b) patched the implementation into the
window
object, and (c) had sensible fallbacks or warnings for the parts that were not implemented yet, then the integration withreact-speech-recognition
could be very simple as the interface would not change, the implementation would be available in the same place, and the two libraries would not need to know anything about each other. In an ideal world, an example usage might look like this:Configuring the client for the cloud provider
One challenge would be configuring the polyfill with the credentials needed for the given cloud provider. A couple of options come to mind:
fallbackOnly: false
) or just as a fallback when the native browser implementation is not available (fallbackOnly: true
). The logic in the polyfill could be something like:react-speech-recognition
as an override for SpeechRecognition. For example:How much of the spec needs to be implemented?
Definitely not the whole thing.
react-speech-recognition
only uses a subset, which consists of the following:continuous
(property)lang
(property)interimResults
(property)onresult
(property). On the events received, the following properties are used:event.resultIndex
event.results[i].isFinal
event.results[i][0].transcript
event.results[i][0].confidence
onend
(property)start
(method)stop
(method)abort
(method)Even amongst these, some could be skipped in a basic polyfill as long as the missing pieces were documented. For example, if the values for
continuous
andlang
were limited, warnings or even errors could be thrown if a user tried to set them to an unsupported value. The concept of an "interim result" could also be optional.Standardising polyfills across cloud providers
AWS Transcribe is not the only service that could be used for this purpose - there are others (Azure, GCP, maybe IBM?). AWS may not be the best choice for a company that has gone all-in on Azure services, for example. So each of these could have their own polyfill to suit the needs of each consumer.
If the polyfills were based on the W3C spec, this spec could potentially be represented by a stub class with all properties and methods provided with "not implemented" warnings. Then polyfills could extend this class and fill in whatever parts they can and add their own warnings to parts for which they have limited support (e.g. maybe only a limited set of languages can be supported by a polyfill via the
lang
property).One suggestion I have is for all these polyfills to eventually live in the same repo and share this stub class. This would help ensure they share a standardised interface and have consistent behaviour. Then all consumers of these polyfills (not just
react-speech-recognition
) could safely swap one for another whenever they change their cloud provider. This might take the form of a small Lerna monorepo with each polyfill being published as a separate package but sharing the base class under the hood. If that base changed, all the polyfills could be updated simultaneously.Alternative: Adapters
The polyfill authors may choose to design their own APIs and diverge from the W3C spec. In that case,
react-speech-recognition
would need to maintain Adapters for each polyfill. While this does give the polyfill authors freedom to create APIs that better suit the features offered by their cloud providers (maybe AWS Transcribe can do things that the SpeechRecognition API has no method for), it does create a few challenges in this repo:react-speech-recognition
with cloud provider SDKs in Webpack bundlesThinking about this more:
react-speech-recognition
currently consumes), these could potentially be owned by the polyfill authors. This would make more sense given the polyfill authors would know better than anyone how to write an Adapter for their polyfillsTL;DR
I'm in favour of this AWS speech recognition polyfill being modified to implement enough of this spec to enable the most basic functionality in
react-speech-recognition
. I would be happy to collaborate on this.And of course, I'm interested to hear others' opinions and alternative designs. Polyfill integration is an exciting proposition as it unlocks speech recognition experiences for more of the web and gives developers a means of building these experiences cross-platform.
Beta Was this translation helpful? Give feedback.
All reactions