Make Your Web Page Talk With espeakng.js

RSS  •  Permalink  •  Created 13 Nov 2016  •  Written by Alberto Pettarin

A few weeks ago I stumbled upon this demo by Eitan Isaacson, which shows a text-to-speech engine in Javascript, entirely running on the client side, and outputting audible audio via the Web Audio API.

Eitan used emscripten to cross-compile the eSpeak text-to-speech (TTS) engine, from C++ to Javascript. The process simply required him to write a couple of "glue" files, translating things like namespaces and function names and types into their JS counterparts. The heavy lift was made by emscripten itself, which took the eSpeak code base and compiled it into a JS library, composed by the main espeak.js, a JS worker (espeak.worker.js), and a data file (espeak.worker.data) containing the binary data needed to synthesize text in the various supported languages.

From the technical point of view, I was impressed by the smoothness of the C++-to-JS translation and the relatively modest size of the compiled JS+data (less than 3 MB). More importantly, I was excited by the potential applications that a client-side TTS library can enable, especially as an accessibility tool. For example, such a TTS library can be the basic building block to create a tool similar to ReadSpeaker, but open source/free software, and even working offline!

Hence, I wondered whether the same translation to JS could be done for eSpeak-ng, which is the evolution of eSpeak lead by Reece Dunn, after Jonathan Duddington (the original developer of eSpeak) disappeared roughly one year and a half ago.

Since Eitan did the original port, the eSpeak(-ng) code base was refactored into pure C, and the build process moved from a monolithic, hand-written Makefile to autoconf tools. These changes required some edits in the emscripten build process; on the other hand, since the eSpeak-ng C API is back-compatible with the eSpeak C API, no significant changes were required in the "glue" code.

I never worked with emscripten before, so I had to spend a couple of hours to understand how it works. Done that, it took me less than one hour to figure out how to compile eSpeak-ng into Javascript, and use the resulting JS files in the Eitan's original demo.

At this point, I decided to publish the resulting code, for others to benefit from it. I contacted Eitan because I wanted to share credit with him, since most of his code was basically unchanged except for some cosmetic edits, and he was very helpful in ironing out the last wrinkles. Eventually, we contacted Reece Dunn, who agreed to merge the emscripten port inside the main eSpeak-ng repository.

(Note: Reece has already merged an initial commit of the emscripten port into the master branch of the eSpeak-ng repository, however a few days later I opened a new pull request with a few clarifications on the inner README about using the JS port, which is currently still open, but Reece will merged it at the first opportunity. Until then, you can refer to my personal repo.)

If you just want to play with espeakng.js, I put a working demo here. For it to work, you need to have JS enabled, and a browser supporting Web Workers and Web Audio API.

The README on GitHub and the provided example should be sufficient to get you started using espeakng.js for your own project.

Finally, please note that the latest compiled version of espeakng.js can be downloaded from the jsdelivr CDN:

https://cdn.jsdelivr.net/espeakng.js/latest/espeakng.min.js
https://cdn.jsdelivr.net/espeakng.js/latest/espeakng.worker.js
https://cdn.jsdelivr.net/espeakng.js/latest/espeakng.worker.data

but remember that you cannot load the Web worker directly from the CDN: hence, you need to download the above three files from the CDN into your development or production environment, and serve them from your domain/origin, as the rest of your JS files.

A Comment on the Web Speech API

Since I started exploring technologies for text synthesis on the client-side of the browser, I saw Jiminy Panoz experimenting with the Web Speech API.

This W3C API aims at standardizing the interfaces of text-to-speech (TTS) and speech-to-text (STT) functions made available to JS engines by different browsers.

One might think that such an API makes projects like espeakng.js irrelevant, because once those TTS functions are made available "natively" in browsers, there will be no need for external libraries like espeakng.js.

I do not think this is exactly the case.

Let's ignore for a moment the fact that the support for this experimental API is currently sparse (see MDN), while espeakng.js works in today's browsers.

There is still a key observation to be made: the specification reads:

The API itself is agnostic of the underlying speech recognition and synthesis implementation and can support both server-based and client-based/embedded recognition and synthesis.

which basically means that the user of those APIs has no direct way of knowing (let alone controlling) the actual TTS engine being used to synthesize text, unless the browser vendor actually provides full documentation for it (e.g., if the TTS engine is embedded in the browser, or if the latter will use TTS services from an online provider or from the operating system).

While I find extremely interesting that STT and TTS functions are coming to the Web as standardized APIs, their actual implementation matter a lot. Entire companies like Nuance or Ivona or Speechmatics exist because of subtle differences in their STT or TTS software, resulting in an edge in one application domain or another.

Moreover, since eSpeak-ng (and, in turn, espeakng.js) is free software maintained and improved by volunteers around the globe, it covers a lot of "minor" languages that are not palatable for TTS companies, as they represent tiny markets. Examples include Amharic, Kyrgyz, or Icelandic.

For these reasons, I believe that, even when the Web Speech API will be fully supported by all major browsers, there will be still room for libraries like espeakng.js.

If you give it a try, let me know what you think!