Dictionaries for Cybook Odyssey and Kobo

Acknowledgments, donations, etc.
If you enjoyed reading this page or using my conversion script, you can send me a "thank-you" email.
If you really enjoyed this work and you feel really grateful to me for writing the conversion script, I would really love to receive a (reasonably recent) 9-inch e-reader or tablet for testing purposes.
If you really really enjoyed my work on this project and you think my brain can help you, I am always glad to hear about job collaborations!
In all three cases, contact me via email, thanks!

Abstract
In this page I describe Penelope, a Python tool for converting dictionaries into the format accepted by Bookeen Cybook Odyssey and Kobo eReaders. I provide two Python scripts (2.x and 3.x) to convert from/to the following formats: StarDict (R/W), XML (R/W), CSV (R/W), Bookeen Cybook Odyssey (R/W), and Kobo (R index only, W unencrypted / unobfuscated only). Additionally, you can merge multiple dictionaries (of the same format) into a single one. You can also use Penelope to output an EPUB file containing a navigable index of the input dictionary.
I also report my findings about the dictionary management of the Bookeen Cybook Odyssey eReader, introduced with firmware 1485. Bear in mind that no official specifications are published by Bookeen (as of 2013-04-23), so what follows is the result of my speculations.
Also the format used by Kobo has been "reversed-engineered" by a group of several users of MobileRead Forum, see the Link section below.
Comments are welcome, especially if you point out mistakes or you have useful suggestions: just drop me an email.

UPDATE 2013-05-01: Please note that Kobo firmware 2.5.0 has broken support for unencrypted/unobfuscated dictionaries. If you need them, you must have firmware 2.5.1 or 2.4.0.

I do not assume any legal liability or responsibility for any damage, data loss or inconvenience that you might cause to yourself or to other people by following the procedures below. RTFM, first.

Download
To get started, please download the files from the Google Code page of this project. Make sure to clone the Mercurial repository, or download the handy ZIP archive from the Downloads page. Alternatively, you can download all the source files (.py, .idx), in raw format (not as HTML pages!).
You need a version of Python, 2.x (preferred) or 3.x, installed on your system to run this scripts.
You might need dictzip installed in your system for opening Stardict dictionaries.
You might need a compiled version of MARISA for outputting in Kobo format. In case, please modify the value of variables MARISA_BUILD_PATH and MARISA_REVERSE_LOOKUP_PATH in penelope.py or penelope3.py, making it pointing to the marisa-build and marisa-reverse-lookup executables (see the comments in the source code).

Script usage
The script must be invoked with at least the following three parameters:

-p prefix_list : prefix_list is a list of comma-separated names of your input dictionaries, without extension
-f xx : xx is the ISO 639-1 code of the language "from" of the dictionary
-t yy : yy is the ISO 639-1 code of the language "to" of the dictionary

The following optional parameters are available:

-h : print usage message and exit
-d : enable debug mode and do not delete temporary files
-i : ignore word case while building the dictionary index
-z : create the .install zip file containing the dictionary and the index
--sd : input dictionary in StarDict format (default)
--odyssey : input dictionary in Bookeen Cybook Odyssey format
--xml : input dictionary in XML format
--kobo : input dictionary in Kobo format (reads the index only!)
--csv : input dictionary in CSV format
--output-odyssey : output dictionary in Bookeen Cybook Odyssey format (default)
--output-sd : output dictionary in StarDict format
--output-xml : output dictionary in XML format
--output-kobo : output dictionary in Kobo format
--output-csv : output dictionary in CSV format
--output-epub : output EPUB file containing the index of the input dictionary
--title string : set the title string shown on the Odyssey screen to string
--license string : set the license string to string
--copyright string : set the copyright string to string
--description string : set the description string to string
--year string : set the year string to string
--parser parser.py : use parser.py to parse the input dictionary
--collation coll.py : use coll.py as collation function when outputting in Bookeen Cybook Odyssey format
--fs string : use string as CSV field separator, escaping ASCII sequences (default: \t)
--ls string : use string as CSV line separator, escaping ASCII sequences (default: \n)

The order of the parameters is irrelevant.
Examples (use penelope3.py instead of penelope.py if you have Python 3.x installed):

$ python penelope.py -h
Print usage message and exit
$ python penelope.py -p foo -f en -t en
Create English monolingual dictionary en.foo.dict and en.foo.dict.idx from StarDict files foo.*
$ python penelope.py -p bar -f en -t it
Create English-to-Italian dictionary en-it.dict and en-it.dict.idx from StarDict files bar.*
$ python penelope.py -p "bar,foo,zam" -f en -t it
Create English-to-Italian dictionary en-it.dict and en-it.dict.idx merging together StarDict dictionaries bar, foo, and zam.
$ python penelope.py --xml -p foo -f en -t en
Create English monolingual dictionary en.foo.dict and en.foo.dict.idx, but the input dictionary foo.xml is in XML format
$ python penelope.py --xml -p foo -f en -t en --output-sd
As above, but output in StarDict format instead of Bookeen Cybook Odyssey format
$ python penelope.py -p bar -f en -t it --output-kobo
As above, but outputs in Kobo format, creating dicthtml-en-it.zip
$ python penelope.py -p bar -f en -t it --output-xml -i
Reads from StarDict format and outputs in XML format, creating bar.xml, lowercasing all the keywords
$ python penelope.py --kobo -p bar -f it -t it --output-epub
Reads from Kobo format and outputs the XML format, creating the dictionary index in EPUB format bar.epub
$ python penelope.py --odyssey -p bar -f en -t en --output-epub
As above, but input is in Bookeen Cybook Odyssey format
$ python penelope.py -p bar -f en -t it --title "My EN-IT dictionary" --year 2012 --license "CC-BY-NC-SA 3.0"
Create English-to-Italian dictionary but also set title, year and license metadata
$ python penelope.py -p foo -f en -t en --parser foo_parser.py --title "Custom EN dictionary"
As above but set its title and use foo_parser.py to parse the input dictionary definitions

Dictionary management on the Odyssey

Dictionaries are located in the Dictionaries/ directory in the root directory of the Odyssey.
Each dictionary has two files: an index ($NAME.dict.idx) and a definition file ($NAME.dict), where $NAME is the dictionary name.
For a monolingual dictionary, $NAME must be $XX.$STRING, where $XX is the ISO 639-1 code of the language, and $STRING is an arbitrary label.
Example: en.foo.dict and en.foo.dict.idx.
For a bilingual dictionary, $NAME must be $XX-$YY, where $XX (resp., $YY) is the ISO 639-1 code of the language from (resp., to) of the dictionary.
Example 1: en-it.dict and en-it.dict.idx is the English-to-Italian dictionary.
Example 2: it-fr.dict and it-fr.dict.idx is the Italian-to-French dictionary.
Right now, the selection of a dictionary is done in the following way. If the book you are reading has no language metadatum, then the default French dictionary is used. (This dictionary is stored in the system partition of the Odyssey, which is not accessible by the user.) Otherwise, let $XX be the language of the book, and let $YY be the language of the Odyssey's interface. If the bilingual dictionary $XX-$YY.dict exists, then it is used. Otherwise, if the monolingual dictionary XX.*.dict exists, then it is used. Finally, the default French dictionary is used.
Right now, the user cannot directly select the dictionary to be used. I hope Bookeen will implement this feature in a future firmware. Apparently, you can have only one bilingual dictionary for each language pair ($XX-$YY) while it is not clear to me which dictionary is used if you have two monolingual dictionaries $XX.1.dict and $XX.2.dict for the same language $XX.
Apparently, when you select a word, the index is queried for a stemmed version of the word. The rules applied might vary depending on the language. Unfortunately, I have not been able to look at this issue extensively, but I noticed that, for example, plurals are recognized in English and French, but not in Italian. However, you can bypass this issue by inserting declinated/conjugated forms in the index, making them point to the definition of the base form.

Format of the definition file

The dictionary file (say, en.foo.dict) is simply a zip file of plain text files, c_1, c_2, ..., c_n.
Each chunk file c_i contains utf-8 encoded definitions of words, concatenated one after another. Two consecutive definitions do not need to be separated by newlines or other special separator, since the index specifies the boundaries of each definition as an offset and a length, in bytes, from the beginning of the chunk (see below).
Apparently, each definition is an HTML fragment. Hence, you can use HTML tags to specify bold or italic face, divs, etc.. I have not performed an exhaustive search for the supported tags yet.
Each chunk file has (uncompressed) size between 2^18 = 262,144 bytes and 2^19 = 524,288 bytes. This is probably due to the memory management of the device, and it is consistent with the EPUB requirement of having single files of at most 300 KB. In fact, my script closes the current chunk (and opens a new one) whenever its size reaches 2^18 bytes.

Format of the index file

The index file (say, en.foo.dict.idx) is an sqlite3 database, with four tables (T_DictVersion, T_DictInfo, T_DictIndex, T_RefKey) and an index (F_WordIndex) based on a collation (IcuNoCase).
Table T_DictVersion contains two fields: F_Version (INTEGER) and F_DictType (TEXT). There is only one record, and it seems to me that the latter is used for documentation reasons only.
Table T_DictInfo contains the metadata associated with the dictionary. It has only one record, with the following TEXT fields:
- F_Title
- F_Description
- F_Licence
- F_Copyright
- F_Year
- F_LanguageFrom
- F_LanguageTo
- F_Alphabet
- F_xhtmlHeader
Fields names are quite self-explanatory. Let me observe that F_Title represents the string shown on the Odyssey as the dictionary's heading. I do not know what F_Alphabet means (it always has value "Z"), perhaps it represents the encoding used in the dictionary definitions.
Table T_DictIndex contains the dictionary lookup table. It has one record for each word, with the following fields:
- F_Key (INTEGER)
- F_Word (TEXT)
- F_Offset (INTEGER)
- F_Size (INTEGER)
- F_ChunckNum (INTEGER)
For example (0, foo, 350, 45, 7) means that the definition of word foo starts at byte 350 of file c_7 and it has length 45 bytes.
Table T_RefKey contains two fields: F_Key (INTEGER) and F_RefKey (INTEGER). In all the index files from Bookeen I have seen this table is empty, and its meaning is unknown to me.

Converting a StarDict dictionary into the Odyssey format

To start, get penelope.py, dictEPUB.py, and empty.idx from Google Code and copy them into the same directory. Make sure to download the raw files, not the syntax-highlighted Google Code pages!
Copy your StarDict dictionary files (say, mydict.ifo, mydict.dict[.dz], mydict.idx[.gz]) into the same directory.
Run the script:
$ python penelope.py -p mydict -f en -t en
The above command should create two files in the working directory: en.mydict.dict and en.mydict.dict.idx.
Copy the two files to the Dictionaries directory of your Odyssey and you are done! You should be able to use your en-en dictionary on your Odyssey!

Converting a StarDict dictionary into the Kobo format

As above, but run the script with --output-kobo flag:
$ python penelope.py -p mydict -f en -t en --output-kobo
The above command should create in the working directory a file named dicthtml.zip.
Copy the file to the .kobo/dict/ directory of your Kobo and you are done! You should be able to use your en-en dictionary on your Kobo!

Converting an XML dictionary into the Odyssey format

If you want to create your own dictionary from some data source, you might want to output your dictionary data with the following format:
- Each entry is represented by a entry tag.
- Each entry has two children tags: key and def, representing the keyword to be inserted in the dictionary index and its definition, respectively.
- Please see this DTD for reference.
Click here to see the example file test.xml.
Note that my "XML parser" for the input dictionary just perform the following actions:
- looks for the beginning of the next entry tag, say at byte x of the input file;
- looks for the beginning and end of the next key tag after x and extract the text between them, identifying it as the word to be included in the index;
- looks for the beginning and end of the next def tag after x and extract the text between them, identifying it as the definition of the current word;
- repeat, starting from byte x+1 of the input file.
This approach is quite fragile (my code does not check that your input file is well-formed!), but it also yields fast code and it allows you to ignore other tags and newlines in the definitions. The assumption is that you know what you are feeding into my script!
Note that you do not need to actually provide a valid XML file, with the proper header and DTD-compliant as in the example, but doing so helps checking the input file in browsers that are XML-picky, like Firefox; a sequence of entry tags is enough for my script.
As above, get penelope.py and empty.idx from Google Code and copy them into the same directory.
Copy your XML dictionary file (say, mydict.xml) into the same directory.
Run the script:
$ python penelope.py --xml -p mydict -f en -t en
The above command should create two files in the working directory: en.mydict.dict and en.mydict.dict.idx.
Copy the two files to the Dictionaries directory of your Odyssey and you are done! You should be able to use your en-en dictionary on your Odyssey!

Converting an XML dictionary into the Kobo format

As above, but run the script with --output-kobo flag:
$ python penelope.py --xml -p mydict -f en -t en --output-kobo
The above command should create in the working directory a file named dicthtml.zip.
Copy the file to the .kobo/dict/ directory of your Kobo and you are done! You should be able to use your en-en dictionary on your Kobo!

Converting a CSV dictionary into the Odyssey format

Suppose you have a CSV file mydict.csv, where each line of the file contains a keyword and its definition, separated by a tab (0x09). You want to convert it into the Odyssey format.
Run the script with --csv flag:
$ python penelope.py --csv -p mydict -f en -t en
The above command should create two files in the working directory: en.mydict.dict and en.mydict.dict.idx.
Copy the two files to the Dictionaries directory of your Odyssey and you are done! You should be able to use your en-en dictionary on your Odyssey!
The default field separator is a tab (0x09), while the default line separator is a newline (0x0a). You can change them by using the command line parameters --fs and --ls, respectively.

Creating the index of a dictionary as EPUB file

As above, but run the script with --output-epub flag:
$ python penelope.py -p mydict -f en -t en --output-epub
The above command should create in the working directory a file named mydict.epub.
Copy the file in your eReader device and you are done! Just open it and you can emulate a search function if your device does not have it! (Please refer to this post for some screenshots.)
A collection of EPUB index dictionaries for several languages is available at the Google Code page.

Custom parser for the input dictionary

By default, the script will just convert the given StarDict dictionary to the Cybook Odyssey format. In other words, it will create the same index of words as it appears in the input dictionary, and it will simply copy the associated definitions with their original formatting.
However, you might want to aggregate different definitions for the same word into a single index entry, even if in the original dictionary they appeared as separate entries. (Example: "Word (1)" and "Word (2)", etc.) Moreover, you might want to perform some changes in the formatting of the definitions. Clearly this operation is input-dependent, as different StarDict dictionaries have different formatting.
To do so, you can issue the optional argument --parser parser.py to instruct my script to process the input dictionary with the parser defined in file parser.py.
Your parser will contain a function
parse(data, type_sequence, ignore_case)
that will take the input dictionary data (as a list of pairs [word, definition]), the type_sequence of the input dictionary and the ignore_case switch.
The output of your parse function is a list of tuples with the following format:
[ word, include, synonyms, substitutions, definition ]
where:
- word is the index key (STRING).
- include is a BOOLEAN telling the script if the current record should be included in the index.
- synonyms is a LIST of STRINGs that will be added to the index and will point to the current definition. It will be used only if include is True. This is useful if you can extract declinated/conjugated forms from the input definition and you want them to point to the base form word.
- substitutions is a LIST of pairs [replace_what, replace_with]. Each "replace_what" will be added to the index and will point to "replace_with", if the latter exists in the dictionary. It will be used only if include is False. This is useful if you can infer that the current word is a declinated/conjugated form and you want to directly refer to its base form instead of showing a rather un-informative definition like "cats is the plural of cat".
- definition is the STRING containing the text of the definition for the current word.
Please see the included webster_parser.py parser for the Webster 1913 StarDict dictionary (you can find it as StarDict-comn_sdict_axm05_webster_1913-2.4.2.tar.bz2 on the Web) to get an idea of how the parser is supposed to work. Reading the source code of webster_parser.py will help as well.

Custom collation function when outputting to Bookeen Cybook Odyssey format

When outputting to Bookeen Cybook Odyssey format is possibile to change the collation function of the SQLite index being built.
To do so, you can issue the optional argument --collation coll.py to instruct my script to use the collation function defined in file coll.py.
Your Python file will contain a function
collate_function(string1, string2)
that will compare the two given strings, returning 0 if and only if string1 and string2 should be considered as equal, -1 if string1 precedes string2, and 1 otherwise.
If the user does not specify a custom collation function, Penelope applies the default collation function, which implements the NOCASE collation as defined in SQLite documentation.
Please see the included files: default_collation.py and the specialized collation function for German collation_de.py.

Notes and Comments

I tried to comment every key point of my script and it should be easy to follow. I took this as a "practical exercise" to learn Python, so please forgive me if you find my code "naive", and drop me an email with your advice to improve it, thanks!
I chose to use as few external modules as possibile to minimize possible porting problems across different platforms. Please let me know if you experience problems running my script, by sending an email with a brief description of your environment (OS, Python version, ...) and I will try to help you.
If you find out bugs in my script or errors in this page or if you want to contribute some code, please let me know by sending an email, thank you!
If you wish to convert a MOBI/PRC dictionary, use MobiUnpack to unpack it, parse the resulting HTML file and output to the XML format described above. Then, you can use my script to create your own Odyssey dictionary.

Links