This post describes the steps needed
to create EPUB 3 eBooks with Media Overlays,
also known as "read aloud" eBooks,
with tips and tricks,
and a full EPUB 3 demo.
Media Overlays 101
First, let me briefly recall
that Media Overlays (MO)
is the technical term
for the part of the EPUB 3 specification
specifying that the (visual) rendition of the text of the eBook
must be accompanied by, and synchronized to,
the (aural) rendition of some pre-recorded audio file,
embedded in the eBook container.
This function is commonly called "read aloud",
assuming that the words in the pre-recorded audio
match the written text.
(But this does not need to always be the case.
For instance, an author might want to synchronize
a certain portion of the text with a background music,
or, for "artistic" reasons, the audio and the text might have different words.)
Usually, when reading systems support MO,
they do so by activating three functions:
the audio rendition (hopefully with dedicated audio controls),
the synchronous highlighting of the text fragment being narrated, and
the "tap-to-play" function, which allows the user
to tap (or click) on the text, and have the aural rendition
restarting from the touched (or clicked) text fragment.
How do Media Overlays work?
The abstract concept is fairly simple:
given a list of text fragments
and a corresponding audio narration,
an MO is a list of assertions like:
"the first fragment of the text is narrated
between 0s and 11s in the audio file",
"the second fragment between 11s and 23s",
"the third fragment between 23s and 29s", etc.
Here "fragment" might be a paragraph,
a sentence, a group of words, or even a single word:
it is up to the author of the eBook
to decide which one suits best the intended user experience.
For example, if you are crafting a children's book with slow narration,
you might want to highlight single words.
If you try to do the same for a "normal speed" Audio-eBook,
it will result in a nasty "catch-me-if-you-can" effect,
effectively forcing you to use a coarser (e.g., sentence level) granularity.
Moreover, if you are going to manually produce the SMIL files,
the finer the granularity, the more labor you will need to do.
The SMIL File Syntax
The EPUB 3 MO specification requires
the sync information to be described using a SMIL file,
similar to the following:
Each of the 15 fragments in this SMIL file specifies four things:
an identifier for the text fragment (e.g., f001 in p001.xhtml);
the corresponding audio file (../Audio/p001.mp3);
the begin time (00:00:00.000);
the end time (00:00:02.680).
In plain English, the SMIL file above says:
f001 in p001.xhtml is "active" between 0s and 2.680s of the playback of audio file p001.mp3,
f002 in p001.xhtml is "active" between 2.680s and 5.480s of the playback of audio file p001.mp3,
and so on.
The SMIL representation looks a bit redundant
because the specification allows for very complex scenarios,
where you might reference multiple XHTML/audio files or
have deeper nestings of <seq> and <par> elements.
For details about Media Overlays, see the
by Matt Garrish,
and then RTFM!
In practice, almost always you will see eBooks where
each SMIL file references just one XHTML file and one audio file,
with a simple, linear structure like the one in the example above:
a <seq> (sequence) element that specifies the serial playback of its children, and
a set of <par> (parallel) elements, each specifying the synchronism between its <text> and <audio> children.
A Concrete Example
It is worthy to note that a "read aloud" eBook
might have "Fixed Layout" or "reflowable" rendition
(or is an hybrid of the two).
Suppose you want to produce a Fixed Layout EPUB 3 with the "read aloud" feature,
containing the first three Sonnets of Shakespeare,
read by Elizabeth Klett for LibriVox.
Where do Media Overlays come into play?
What files should you add to our eBook?
You need to perform the following steps, for each XHTML file you want to attach audio to:
produce an audio file,
add suitable id attributes to elements in the page, one for each desired SMIL fragment,
create a SMIL file specifying the sync information between the text and the audio,
adding the highlighting class to the CSS,
add the suitable metadata to the OPF file of the eBook.
Step 1: Creating the audio file
I assume you have already recorded or
you have been given the audio file, so
I just add a few suggestions from my experience in the trenches:
use MP3 or MP4/AAC formats;
use 44100 Hz or 48000 Hz;
if you are creating an Audio-eBook with hours of audio, use mono 64-128 kbps;
if you are creating a Fixed Layout eBook, use stereo 128 kbps or better;
if your eBook has multiple pages/chapters but you have been given a single audio file,
you should consider splitting it into several audio files, one per page/chapter.
You can easily do that using Audacity,
a free audio editor which has a nice "split into multiple files" function.
This will reduce loading time, provide better publication granularity,
and potentially increase the number of reading systems capable of correctly rendering your eBook.
Step 2: Adding id attributes to the XHTML file
Now you need to decide the granularity of your SMIL fragments.
As mentioned before, you can choose finer (single word)
or coarser (sentence/paragraph) granularity,
depending on your eBooks and the effect you want to produce.
For our example, since we deal with poetry, we choose
Suppose you start with this XHTML page:
You need to insert id attributes to each verse,
possibly by creating <div> or <span> elements,
obtaining the following markup:
Clearly this step requires some effort
if you have to perform it on an already existing XHTML file.
If you control your own workflow,
adding <span> and id to the markup automatically is relatively easy,
and the same goes for the splitting of the text,
based on the punctuation or XHTML elements.
(In ReadBeyond we clearly adopt the latter approach.)
Step 3: Creating the SMIL file
Now that you have the audio file and the text fragments,
it is time to create the SMIL file.
Besides formatting of the actual file,
which is just a matter of respecting the SMIL syntax
(feel free to copy and paste from above!),
the hard part of this step consists in defining
the time intervals in the audio file,
each containing the spoken version of the text of each text fragment.
You have two options:
mark the time labels manually, or
use an automated tool.
3.1 The manual approach
You will basically listen to the audio file,
and create a marker each time the current fragment ends.
In principle, you can directly author the SMIL file,
preparing a template like this:
and playing the audio file: as soon as the next text has been spoken,
pause the audio, and fill the current time in:
However, usually it is easier to use an audio editor
which can export time labels.
I will illustrate the process using Audacity,
an excellent free audio editor.
This is the same procedure suggested in the iBook Asset Guide.
3.1.1 Open the audio file
Open the audio file (File > Open). You will see a waveform like this:
3.1.2 Create the time labels
Then, select the portion of the audio corresponding to the first text fragment.
In our case, the audio says "One" (whereas the text reads "I").
Press CTRL+B (Linux/Windows) or CMD+B (Mac) to create the label track and the first label:
Now you can input an identifier for the selected time interval.
To make the SMIL formatting faster,
it is useful to give it the id associated text fragment
or a value from which the latter can be easily computed.
In our case, we will use 001:
(Note that our id is f001, since id must be a valid XML identifier,
and hence cannot start with a digit.
We will add the prefix f to all the fragments
when formatting the final SMIL file.)
Taking advantage of the "auto-snap" function of Audacity,
which allows you to precisely start a selection
from the boundary of the previous selection,
keep repeating the above procedure for all the remaining fragments:
3.1.3 Export the time labels
Now you need to export the time labels.
Select the menu File > Export Labels:
And then choose a name for the exported file, for example timelabels.txt:
The resulting timelabels.txt is a tab-separated plain text file,
containing on each line:
the begin of the interval (in seconds),
the end of the interval (in seconds), and
the label you choose.
3.1.4 Formatting the SMIL file
Formatting the SMIL file is just a regular expression away.
If you use Vim, open the file of the previous step and give the command:
(note that ^I is the tab character), obtaining:
Adding the header and the footer yields a working SMIL file p001.xhtml.smil:
If you want a prettier output, e.g. with human-readable timings,
you might want to use a more complex regex or a Python script.
Timings finer than the millisecond are generally worthless,
due to the audio latency in the hardware/OS/reading system
of the user device, and the human response time.
Note that Audacity is not the only choice you have,
as there are other tools with a specialized GUI
to help you speed the production of SMIL files up.
These include TOBI (audio + EPUB 3/DAISY authoring tool, Windows only)
and PubCoder (EPUB 3 FXL authoring tool, Mac only).
However, both tools just provide a more "ergonomic" interface,
and you still need to listen to audio and mark the labels manually.
Clearly, this approach is suitable only
for a limited number of fragments.
If you need to process hundreds or thousands of fragments,
e.g. for Audio-eBooks,
you really want to use an automated tool.
3.2 Using ReadBeyond Sync
IMPORTANT UPDATE (2015-05-25):
ReadBeyond Sync was retired.
The source code of the audio/text aligner
has been published on GitHub,
so you can use it for free on your own machine.
See this post for details.
The goal of ReadBeyond Sync is to free you
from the long, boring, and error-prone task of producing SMIL files.
All you need to do is to upload the XHTML file, the audio file,
and specify some options (the language, the output format, etc.),
and hit the "Process" button.
Sync is optimized for computing SMIL files for EPUB 3 eBooks,
and is designed to easily integrate into your work flow,
thanks to lots of options, including flexible output and
batch import of several tasks at once.
In our example, we have three XHTML pages,
each with its corresponding MP3 file.
The easiest way to use Sync is to create a ZIP file
including the following files:
The config.txt file contains the Sync configuration options:
As you can see, it is quite straighforward to read:
it instructs Sync to parse the XHTML files in assets/,
reading the text from elements with id matching f[0-9]+,
and pairing each XHTML file with the corresponding MP3 file.
For each pair (text, audio), it will capture the name (e.g., p001) into $PREFIX,
and it will output a SMIL file named $PREFIX.xhtml.smil,
with the desired src references:
the text reference to the same directory,
while the audio reference to ../Audio/,
because we like our EPUB structured like that.
(Sync has lots of options, please see the documentation.)
Observe that you can reuse the above configuration
for other project, provided that you are consistent with the naming of the files.
Even modifying it on-the-fly is simple!
After processing the uploaded materials,
Sync will generate a ZIP file containing:
For example, p001.xhtml.smil contains:
And the best part is that you can immediately copy the generated SMIL files
inside your EPUB 3 working directory and forget about SMIL authoring!
If you want to play around with Sync
(which is in free Beta),
these are the input and output ZIP files:
I guess I do not need to emphasize the fact that Sync
is the only sane way to go, if you need to author SMIL files
with hundreds or thousands of fragments,
either because you have very long audio audio
or you are doing word-level sync.
Pause a minute, and think:
do you really want to waste hours producing SMIL files,
when you can use our automated tool Sync for a few bucks per hour of processed audio?
Step 4: Modifying the CSS
Great, the hardest part of the job has been done,
and now you just need to add the last touches to your EPUB eBook.
The first is defining a CSS class for the MO highlighting.
For example, if you want to highlight the text with a yellow background,
you can add the following:
You can change the name -epub-media-overlay-active to whatever you prefer.
The mechanism is simple: once an element (i.e., a certain id)
becomes the "active" MO fragment,
the reading system adds the class -epub-media-overlay-active to it,
and it removes the same class when the element is no longer the active one.
Step 5: Modifying the OPF file
Finally, you need to add some elements and metadata to the OPF file.
5.1 Media Overlays metadata
In the <metadata> section of your OPF file,
you need to specify a media:duration item for each of the audio files in your eBook
(3 in our case, note the usage of refines),
and one item for its sum (the whole publication).
The value of each element is the running time of the corresponding audio.
Optionally, you can specify the name of the narrator(s) in one or more media:narrator metas;
I encourage you to do so, because reading systems can read this information and display it
to the user (e.g., Menestrello shows the narrator name in its Library View).
Although optional, you probably want to specify
the CSS class name defined in the previous section,
using the media:active-class property.
Be aware that not all reading systems properly support it,
for example Kobo apps rely on a naming convention (kobo-smil-highlight)
and ignore the value specified in the OPF file,
while other reading systems allow the user to override
the author's choice for the highlighting style
(e.g., Readium or Menestrello).
Note that media:active-class cannot be refined, as it is assumed to
apply to every MO in the publication.
Also note that EPUB 3.0.1 introduced another class media:playback-active-class,
which you can define similarily to media:active-class.
Refer to Section 3.4 of the MO specification for details.
(Moreover, I suggest we should also have media-paused-class.)
5.2 Manifest items
For each chapter/page, in the <manifest> of your OPF file you need to add:
the media-overlay attribute to the <item> corresponding to the XHTML document, referencing the SMIL <item>,
an <item> for each SMIL file, and
an <item> for each audio file.
Note that the refines of media:duration explained in the previous section
need to point to the <item>id of the SMIL elements (e.g., s001 in the example above).
Also note that the each XHTML document can be associated with at most one SMIL file.
5.3 Putting all together
You will end up with something like the following:
(Note: no effort has been made to make this demo look "pretty";
after all, this demo is about MO, not CSS/FXL gimmicks.
I chose to show a FXL sample because
unfortunately MO are better supported for FXL eBooks,
and ReadBeyond has plenty of
reflowable EPUB 3 with MO eBooks
freely available already.)