Audio-eBooks: using Media Overlays in reflowable EPUB 3 eBooks

RSS  •  Permalink  •  Created 22 Jul 2014  •  Updated 24 Jul 2014  •  Written by Alberto Pettarin

I share some lessons learned working with "Audio-eBooks", which are reflowable EPUB 3 eBooks with embedded audio and Media Overlays.

Note: this article was originally written for EPUBZone upon their invitation. More than six weeks after submitting it, I have not heard back from them. So, I decided to publish it here.

EDIT 2014-07-24: today I got a mail saying that my article is "too commercial" for EPUBZone. Alright, no problem, you can still read it here.

Media Overlays 101

First, let me briefly recall the abstract concept of Media Overlays (MO). Given a list of text fragments and a corresponding audio narration, an MO is a list of assertions like:

  • "the first fragment of the text is narrated between 0s and 11s in the audio file",
  • "the second fragment between 11s and 23s",
  • "the third fragment between 23s and 29s", etc.

The "fragment" might be a paragraph, a sentence, a group of words, or even a single word. The EPUB 3 MO specification requires this information to be described using a SMIL file embedded inside the EPUB container.

For a detailed introduction to Media Overlays, see the excellent post by Matt Garrish.

Media Overlays and the benefits of reading+listening

"Multimodal reading" or "double exposure" (i.e., reading and listening at the same time) has been the subject of dozens of research studies, which demonstrated that it yields a deeper and longer-lasting comprehension of the text when compared to just reading or just listening.

Moreover, the narration of a (professional) human speaker is very helpful in learning contexts, for example foreign language learning, where text-to-speech approaches are suboptimal since they miss many of the features of the human voice (e.g., proper intonation, correct pronunciation, etc.) that are essential for the learning process.

Additionally, until TTS engines will become aware of the text semantics and will be able to mimic the voice of the best human narrators, the users who read+listen books for leisure will surely continue to prefer the recorded performances of their favorite narrators over audio synthesized by a TTS engine.

But what is the role of MO and why should we include them in a digital book with audio?

The information contained in MO allows a reading system (RS) to enhance the simultaneous visual and aural rendition of the publication, for example highlighting the words being spoken or enabling the tap-text-to-play-it feature, by making the user experience:

  • more engaging and emotively compelling,
  • easier (more ergonomic) to enjoy, and
  • accessible to a wider spectrum of people.

For example, people affected by dyslexia find the synchronous highlighting very helpful to better understand the book contents, not just thanks to the double exposure but also for the possibility of navigating the text by simply touching a written word and having the audio restarting from there. (And "normal" readers like this feature too!)

The state of the art

So far, MO have seen wide commercial adoption only in Fixed Layout documents (especially children's books), as they feature a limited amount of text (15--60 minutes, once narrated).

In fact, from an authoring perspective, the two major problems are:

  • Problem 1: producing high quality audio recordings is resource-intense, requiring the work of professional narrators and sound editors, and expensive equipment.
  • Problem 2: creating the SMIL files is a tedious, manual process.

Let me note that, for a good number of titles, Problem 1 is immaterial, since the audiobook version of the book has already been recorded. On this point, see also a previous post by Abigail Fenton of HarperCollins UK.

However, Problem 2 remained a major technical obstacle for the mass production of longer, MO-enabled publications, like unabridged long novels.

At least, until now.

Media Overlays for reflowable EPUB 3: a pilot project

In December 2012, my company ReadBeyond (at the time called Smuuks), started producing the series "The Voices of The Classics" in Audio-eBook format, for the Italian audiobook publisher il Narratore audiolibri. Each of these reflowable EPUB 3 eBooks feature the audio recording of the work performed by a professional narrator, and it includes Media Overlays to synchronize the audio with the text.

The most interesting technical aspect of the project is perhaps the production of the SMIL files, especially considering that we had to deal with very long, unabridged works like "Moby Dick" or "I promessi sposi", which are 25+ hours long, and have 30,000+ SMIL fragments (at sub-sentence level).

Taking a manual approach was not feasible. Instead, we developed Sync, a tool which receives in input a text file and an audio file (containing its narration), and it outputs the corresponding SMIL file. The computation has no human in the loop, so we can massively scale by processing in parallel different audio tracks. (If you are interested in playing with it, Sync is currently in free beta, and it supports 17 languages:

IMPORTANT UPDATE (2015-05-25): ReadBeyond Sync was retired. The source code of the audio/text aligner has been published on GitHub, so you can use it for free on your own machine. See this post for details.

Problem 2 solved.

But this was just the beginning of the story.

Challenges and lessons learned

A first design choice we faced was choosing the SMIL fragment granularity for our Audio-eBooks. After some testing with the publisher and the users, we discovered that the best option consisted in sub-sentence level granularity. In fact, word level highlighting proved distracting for long texts like ours, due to an annoying "catch me if you can" effect.

Audio-eBooks have a potential problem: they are very large files. Even when compressed to Mono/44.1 KHz/128 Kbps MP3 format, several hours of audio can easily amount to hundreds of megabytes. Dealing with such large files might prove problematic for the distribution platform and/or the online stores. (In our case, the distribution platform required minor modifications to accept our large files.)

I want to note that the publisher smartly chose to sell their Audio-eBooks without any DRM or watermark, as they were already doing for their audiobooks in MP3 format. This choice offers several advantages:

  1. the Audio-eBooks are readable by any RS accepting EPUB 3 files;
  2. the risk of running into troubles with DRM/watermark systems (due to the large file size) was entirely avoided;
  3. the user is trusted and respected, and retains the freedom of enjoying her Audio-eBook on an unlimited number of devices.

Speaking of RS, the major problem we have faced so far has been the lack of support for MO in reflowable EPUB 3 eBooks. In December 2012, basically only two RS (Readium and AZARDI) were able to open a reflowable eBook and activate the synchronous highlighting described by MO, and they were available only for desktop systems. A few more apps like iBooks supported MO, but only for FXL ebooks. In 2013 the situation improved a bit, and several new apps started supporting reflowable EPUB 3 with MO, including some apps for mobile platforms.

Unfortunately, the user experience that these apps provided was not matching our expectations, and the comments of the customers were clear. While they praised the quality of the "product" Audio-eBook, they complained about UI/UX difficulties with the existing RS, especially on mobile platforms. In September 2013, ReadBeyond decided to take action, and startd developing Menestrello, an app for Android and iOS specifically designed for EPUB 3 Audio-eBooks. (Menestrello is the Italian word for "minstrel".) The app is now quite mature and it is freely available on Google Play and the App Store. It is not bound to any store or publisher, allowing readers (and practitioners) to freely, anonymously load their (DRM-free) EPUB 3 Audio-eBooks.

I cite Menestrello because we learned that crafting a great eBook is not sufficient to create a great eBook experience: the RS plays a big role, and it must carefully designed. This is especially true for Audio-eBooks, where the app needs to manage functionalities and user interactions which are more complex than those required by "text only" eBooks. The native support for MO that Readium SDK will soon ship is a very important contribution indeed, but, as noted above, the UI/UX of the RS is even more crucial to create a pleasant user experience.

Speaking of user experience, if you want to try first-hand what an Audio-eBook is, ReadBeyond recently published a series of 111 English Short Stories, freely available at this Web page.

Open problems and final remarks

Due to space limitations, I omitted many technical details, and focused on the high level challenges and lessons learned working with reflowable EPUB 3 and Media Overlays. Please feel free to contact me if you want more information on our projects.

I want to conclude by listing some open problems in the area, hoping that many will join us in advocating for and tackling them.

  1. Wider support for MO in reflowable EPUB 3 eBooks in popular RS.
  2. Support for defining multi-level SMIL files, allowing the user to select the desired level of granularity of the text fragments "at read time".
  3. In MO, multiple <text> elements in the same <par> will enable interesting applications like multi-language Audio-eBooks.
  4. Support for loading audio from a remote location, possibly with local caching if offline utilization is desired at a later time.
  5. Mixed text- and audio-driven automated splitting of the input text.

Finally, a bold statement: the potential of Audio-eBooks has not yet been fully explored by the industry, but I believe that a bright and shining future awaits them.