This post is a practical introduction
with concrete examples of how to use it
to compute audio/text sync maps.
aeneas is a Python library and a set of tools
to automagically synchronize audio and text.
In other words, the main function of this software is to automate the computation of a synchronization map file ("sync map" for short) between an audio file and a list of text fragments. Sync maps have a variety of uses, including reflowable EPUB 3 Audio-eBooks or FXL Read-aloud EPUB 3 ebooks (SMIL files) and closed captioning videos (SRT/WebVTT/TTML files).
In abstract terms, a sync map associates each text fragment with the time interval, in the audio file, when that text fragment is spoken:
The major advantage of
aeneas is to eliminate
the need for human labor to produce the timings
(which usually involves painfully long "listen-and-mark" sessions),
while still producing a "correct" output,
that is, sync maps indistinguishable from those that a human operator
would produce manually.
Assuming you have Python 2.7.x and Git in your machine,
aeneas is easy:
Note that you might need to:
if you do not have
espeak installed already.
If you are running an (old) stable version of Debian,
you might get an error when installing
scikits.audiolab Python package.
In that case, please see this thread.
(I will see if I can remove the dependency from this library,
by switching to a less-problematic-to-get one.)
Right now the only supported OS is Linux (Debian),
but I have
aeneas configured and running on my Mac Mini (OS X) and
it was confirmed to be working on a Windows 8 machine too.
Please see the online documentation for more information.
Computing A Sync Map With
aeneas jargon, a
Task represents the atomic unit of work,
that is, an audio file and a list of text fragments
to be synchronized, and for which you want to obtain a sync map file,
in the format (SMIL, SRT, TXT, etc.) you need.
To generate the sync map file, you can use the
script included in the package:
The script takes the following parameters:
- the path to the audio file (
- the path to the file containing the text fragments (
- the configuration string (
- the path to the sync map file to be created (
Let's examine each argument.
The Audio File
The audio file contains the narration of the text to be synchronized.
Any format readable by
ffmpeg can be used, including the popular
MP3, MP4, AAC, OGG, WAV, FLAC, WebM.
(Make sure you have the relevant codecs installed.)
The Text File
The text file contains the text fragments to be synchronized. Currently, three formats are supported:
In all three cases, the file must be encoded using UTF-8 (without BOM).
The first format,
plain, simply lists the fragments, one per line.
For example, if
text.txt contains the following 15 lines:
execute_task will align 15 fragments, one for the title (
and 14 others, one for each verse.
text.txt contains the following 107 lines:
execute_task will align 107 fragments,
at word-level granularity.
If you specify the text fragments using the
plain text file format,
aeneas will automatically assign to each fragment,
in the same order they appear in the input text file,
This is done because for certain sync map formats, like SMIL,
you need a (unique)
id for each text fragment.
The second format,
parsed, is similar, but
it allows the user to explicitly provide the
id of each text fragment.
To do so, each line still corresponds to a text fragment but now
it must contain the
| (pipe) character as the separator,
and the text of the fragment.
For example, the following
is equivalent to the first plain example above.
Clearly, a best practice consists in generating the
as valid XML
ids (i.e., as shown above,
one letter followed by a fixed number of digits,
forming progressive, consecutive numbers).
However, nothing impedes you from providing something like:
(whatever logic is behind the choice of the
If this is the case, the
unparsed text file format allows
aeneas to extract the text fragments by directly parsing the XML DOM.
Suppose you have the following
Clearly, you must instruct
aeneas to identify
the elements that contain the text to be actually used for the synchronization.
In the above example, you want to extract the text from elements
id attribute matching the following
f followed by three digits).
To do so, you will specify
in the configuration string (see below).
If not ambiguous (know your source!),
you can also use the wildcard characters
In the above example, you can use
f followed by one or more digits)
To reduce ambiguity, you might also instruct
aeneas to look for elements with
a given value in their
class attribute. If your input file is:
you might want to specify both the following requirements:
classmust match (that is, must contain the value)
Similarly to the previous case, your configuration string will contain
aeneas asks you to specify the order in which the extracted
text fragments should be aligned.
In fact, the order in which the elements might appear in the DOM
might be different from their order in the audio file.
For example, you might have the following portion of DOM:
and you want the extracted fragments to appear in this order:
f002(From fairest creatures we desire increase,)
f003(That thereby beauty's rose might never die,)
f004(But as the riper should by time decease,)
f005(His tender heir might bear his memory:)
In this case, you will specify the following parameter in the configuration string:
which will instruct
aeneas to disregard any non-digit appearing in the
and sort the text fragments according to the remaining numeric part (leading zeroes are ignored).
Other options for
unsorted (do not reorder the text fragments) and
lexicographic (sort the
ids based on their lexicographic order).
The Configuration String
As mentioned above, there are a few parameters you must specify
execute_task, in order to have your input files processed correctly.
To that end, you need to write a configuration string,
which is a UTF-8 encoded string that looks like this:
The order of the
key=value pairs does not matter,
but you must use the
| (pipe) character to separate them.
(I know this syntax looks a bit clumsy and cumbersome,
but it is very compact and it can be directly passed to APIs,
like we did in ReadBeyond Sync.
If I have time, I will enhance
execute_job with an argument parser,
allowing the user to specify parameters using switches like
--language en or
You need to specify at least three parameters:
- the language of your input materials (e.g.,
- the format of the text file (e.g.,
- the format of the sync map to be output (e.g.,
The resulting string is:
For example, assuming you have an audio file
a plain text file
/tmp/subs.txt, both in English (
and you want to output a file
/tmp/subs.srt in SRT format (
you will issue the following command:
If you need to run several tasks sharing the same configuration string,
you might want to assign the latter to a shell variable
This mechanism is adequate as long as you have few tasks and/or
you want to run them one-by-one.
An handier mechanism leverages the
execute_job program, described below.
The configuration string might have additional, optional parameters.
The two most useful ones are:
is_audio_file_head_length=X: ignore the first
Xseconds of the audio file
is_audio_file_process_length=Y: synchronize only
Yseconds of the audio file
which allow you to "cut" (for the synchronization purposes) the head of the audio file,
its tail or both. For example, if you have an audio file of total length
is_audio_file_head_length=20: sync from
60sin the audio file
is_audio_file_process_length=50: sync from
50sin the audio file
is_audio_file_head_length=20|is_audio_file_process_length=30: sync from
20s+30s=50sin the audio file
As discussed above while describing the
unparsed text format,
when you specify the
is_text_type=unparsed parameter, you must also specify:
- optionally, you might also set
When you want to output in SMIL format (
you must also specify the values for the
src attribute of:
For example, you might have the following configuration string:
For the sake of clarity, I will break it down into pairs:
which will instruct
aeneas to produce a SMIL file like this:
Please note that, for
<audio> elements, the relative path
../audio/audio.mp3 has been used,
as specified in the configuration string.
References To The Documentation
- Languages: docs
- Input text formats: docs
- Output sync map formats: docs
- ID sorting algorithms: docs
- Parameter keys: docs
Please also refer to the examples you can find in the
of the cloned repo.
Computing Multiple Sync Maps At Once
As briefly mentioned above, especially if you work with EPUB 3 eBooks, you might have dozens of tasks to run, all with the same configuration parameters.
In this case, you can create a
Job, that is, a set of
and process them in batch using the
In its simplest form, this command takes two arguments:
/path/to/job.zipis a ZIP file containing all the input assets (i.e., a pair of audio/text files for each task) and a special configuration file
config.xml) containing the runtime instructions
/path/to/output/dir/is the directory where the output archive, containing the output sync maps, one for each task, should be created
Note that, instead of creating an input ZIP file, you can also pass a path
to an uncompressed directory
aeneas/tests/res/example_jobs directory you can find
several examples of job directories, with different ways
of arranging the input files inside the input container directory hierarchy,
and with different runtime parameters.
In what follows, I will describe the contents of the
textual/INI-like configuration file,
which is the simplest way of specifying a job configuration,
yet it should cover a vast majority of use cases.
If you need a finer control over the job configuration,
for example you have different tasks with different languages,
you can create a
config.xml XML configuration file:
see the documentation for more details.
config.txt Configuration File: Flat Case
Suppose you have the following files in the
config.txt file contains the following:
If you run the following command:
you will get a ZIP file
/tmp/output_flat.zip containing three SMIL files,
one for each of the three tasks found:
aeneas that the assets
in the input container are contained within the same directory,
(Note that all the paths for the input assets
are relative to the
The audio and text files for each task are identified
by matching the
is_audio_file_name_regex regular expressions.
A task is created only if both the audio file and the text file
are matched and they share the same name prefix.
specify the desired output directory hierarchy.
Note that the
$PREFIX placeholder will be replaced
by each task name
sonnet003 in the example).
Finally, please note that the language is set (for all the tasks)
to English by the
config.txt Configuration File: Paged Case
If your tasks are divided into subdirectories of the main directory
you must specify the
paged hierarchy in your
will create the ZIP file
Please note that if you use
you must provide a regex for
which will be used to to identify the tasks
by matching the subdirectory names
is_task_dir_name_regex=[0-9]+ in the example).