Penelope
Abstract
Penelope is a multi-tool for creating, editing, converting, and merging electronic dictionaries, especially for eReader devices, like Kobo or Bookeen Cybook Odyssey devices.
I do not assume any legal liability or responsibility for any damage, data loss or inconvenience that you might cause to yourself or to other people by following the procedures below. RTFM, first.
Updates
IMPORTANT UPDATE (2013-04-27) Kobo issued a new firmware 2.5.1 (thanks!), which allows you to use unencrypted/unobfuscated dictionaries again, including those produced by Penelope. Some minor bugs in the UI/UX are still present, but at least the custom dictionaries are back!
UPDATE (2013-04-23) It seems that Kobo, with firmware 2.5.0, requires the dictionaries to be encrypted/obfuscated. Hence, the dictionaries output by Penelope do not longer work on Kobo devices. I contacted Kobo staff via Twitter, and they forwarded the notice to their development team. I hope they will fix the issue with a new firmware release soon. Meanwhile, if you need your custom-made dictionaries, you must stay with or revert to firmware 2.4.0.
Features
With the current version (v. 1.19, 2013-04-23) of Penelope you can:
- convert a dictionary from/to the following formats:
- Bookeen Cybook Odyssey (R/W)
- Kobo (R index only, W unencrypted/unobfuscated only)
- StarDict (R/W)
- XML (R/W)
- CSV (R/W)
- merge more dictionaries (of the same type) into a single dictionary
- define your own parser for each word/definition
- define your own collation function when outputting to Bookeen Cybook Odyssey format
- generate an EPUB file containing the index of a given dictionary (e.g., to cope with the lack of a search function on your eReader)
Future versions will include:
- support for reading/writing PRC/MOBI dictionaries
- substantial code refactoring
- support for variants in Kobo format
Download
Please download the files from the Google Code page.
You can either:
- download the handy ZIP archive from the Downloads page (preferred option);
- clone the repository using Mercurial (hg); or
- download all the source files into the same directory, in raw format (not as HTML pages!).
You need Python, either version 2.x or 3.x, installed on your system to run Penelope.
You might need dictzip installed in your system to read from/write to StarDict dictionaries.
If you want to read from/write to Kobo format, you need a compiled version of MARISA.
In case, you must modify the value of variables MARISA_BUILD_PATH and MARISA_REVERSE_LOOKUP_PATH
in penelope.py (Python 2.x) or penelope3.py (Python 3.x),
making it pointing to the marisa-build and marisa-reverse-lookup
executables (see the corresponding comments in the source code).
Usage
In a terminal, issue:
$ python penelope.py -h
to get the list of available options:
$ python penelope.py -p <prefix list> -f <language_from> -t <language_to> [OPTIONS]
Required arguments:
-p <prefix list> : list of the dictionaries to be merged/converted (without extension, comma separated)
-f <language_from> : ISO 631-2 code language_from of the dictionary to be converted
-t <language_to> : ISO 631-2 code language_to of the dictionary to be converted
Optional arguments:
-d : enable debug mode and do not delete temporary files
-h : print this usage message and exit
-i : ignore word case while building the dictionary index
-z : create the .install zip file containing the dictionary and the index
--sd : input dictionary in StarDict format (default)
--odyssey : input dictionary in Bookeen Cybook Odyssey format
--xml : input dictionary in XML format
--kobo : input dictionary in Kobo format (reads the index only!)
--csv : input dictionary in CSV format
--output-odyssey : output dictionary in Bookeen Cybook Odyssey format (default)
--output-sd : output dictionary in StarDict format
--output-xml : output dictionary in XML format
--output-kobo : output dictionary in Kobo format
--output-csv : output dictionary in CSV format
--output-epub : output EPUB file containing the index of the input dictionary
--title <string> : set the title string shown on the Odyssey screen to <string>
--license <string> : set the license string to <string>
--copyright <string> : set the copyright string to <string>
--description <string> : set the description string to <string>
--year <string> : set the year string to <string>
--parser <parser.py> : use <parser.py> to parse the input dictionary
--collation <coll.py> : use <coll.py> as collation function when outputting in Bookeen Cybook Odyssey format
--fs <string> : use <string> as CSV field separator, escaping ASCII sequences (default: \t)
--ls <string> : use <string> as CSV line separator, escaping ASCII sequences (default: \n)
Examples:
$ python penelope.py -h
$ python penelope.py -p foo -f en -t en
$ python penelope.py -p bar -f en -t it
$ python penelope.py -p "bar,foo,zam" -f en -t it
$ python penelope.py --xml -p foo -f en -t en
$ python penelope.py --xml -p foo -f en -t en --output-sd
$ python penelope.py -p bar -f en -t it --output-kobo
$ python penelope.py -p bar -f en -t it --output-xml -i
$ python penelope.py --kobo -p bar -f it -t it --output-epub
$ python penelope.py --odyssey -p bar -f en -t en --output-epub
$ python penelope.py -p bar -f en -t it --title "My EN->IT dictionary" --year 2012 --license "CC-BY-NC-SA 3.0"
$ python penelope.py -p foo -f en -t en --parser foo_parser.py --title "Custom EN dictionary"
$ python penelope.py -p foo -f en -t en --collation custom_collation.py
$ python penelope.py --xml -p foo -f en -t en --output-csv --fs "\t\t" --ls "\n"
$ python penelope.py --csv -p foo -f en -t en --output-xml --fs "\t\t" --ls "\n"
Notes
- If you use Python 3.x, replace
penelope.pywithpenelope3.py. - You must have the Python executable (or a directory containing it) listed in your
PATHenvironment variable, or you need to supply its full path. - If you get an error about
MARISA, check that you have compiled it correctly, and that your user has the execution right on them. - Bear in mind that no official specifications are published by either Bookeen or Kobo, hence the dictionaries produced by Penelope for Bookeen Cybook Odyssey and Kobo devices work as far as their specifications have been reverse-engineered, by others and myself. (See, for example, the following MobileRead forum threads: T1 T2 T3 T4)
- I tried to comment every key point of my script and it should be easy to follow. I took this as a practical exercise to learn Python, so please forgive me if you find my code naive, and drop me an email with your advice to improve it, thanks!
Commented Examples
Example 1
$ python penelope.py -h
Print usage message and exit
Example 2
$ python penelope.py -p foo -f en -t en
Create English monolingual dictionary en.foo.dict and en.foo.dict.idx from StarDict files foo.*
Example 3
$ python penelope.py -p bar -f en -t it
Create English-to-Italian dictionary en-it.dict and en-it.dict.idx from StarDict files bar.*
Example 4
$ python penelope.py -p "bar,foo,zam" -f en -t it
Create English-to-Italian dictionary en-it.dict and en-it.dict.idx merging together StarDict dictionaries bar, foo, and zam
Example 5
$ python penelope.py --xml -p foo -f en -t en
Create English monolingual dictionary en.foo.dict and en.foo.dict.idx, but the input dictionary foo.xml is in XML format
Example 6
$ python penelope.py --xml -p foo -f en -t en --output-sd
As above, but output in StarDict format instead of Bookeen Cybook Odyssey format
Example 7
$ python penelope.py -p bar -f en -t it --output-kobo
As above, but outputs in Kobo format, creating dicthtml-en-it.zip
Example 8
$ python penelope.py -p bar -f en -t it --output-xml -i
Reads from StarDict format and outputs in XML format, creating bar.xml, lowercasing all the keywords
Example 9
$ python penelope.py --kobo -p bar -f it -t it --output-epub
Reads from Kobo format and outputs the XML format, creating the dictionary index in EPUB format bar.epub
Example 10
$ python penelope.py --odyssey -p bar -f en -t en --output-epub
As above, but input is in Bookeen Cybook Odyssey format
Example 11
$ python penelope.py -p bar -f en -t it --title "My EN-IT dictionary" --year 2012 --license "CC-BY-NC-SA 3.0"
Create English-to-Italian dictionary but also set title, year and license metadata
Example 12
$ python penelope.py -p foo -f en -t en --parser foo_parser.py --title "Custom EN dictionary"
As above but set its title and use foo_parser.py to parse the input dictionary definitions.
A detailed description of custom parser/collation
can be found in the old page.
Example 13
$ python penelope.py -p foo -f en -t en --collation custom_collation.py
As above but use custom_collation.py to perfom key collation.
A detailed description of custom parser/collation
can be found in the old page.
Example 14
$ python penelope.py --xml -p foo -f en -t en --output-csv --fs "\t\t" --ls "\n"
Create CSV English dictionary foo.csv from XML dictionary foo.xml, and using a double tab as field separator, and a newline as line separator
Example 15
$ python penelope.py --csv -p foo -f en -t en --output-xml --fs "\t\t" --ls "\n"
Create XML English dictionary foo.xml from CSV dictionary foo.csv, and using a double tab as field separator, and a newline as line separator
Support and Contribution
The current version runs both under Python 2 or Python 3, and it has been tested under Linux (Debian, Fedora) and Windows (XP, 7). Unfortunately, since I do not have any financial support for the project, I cannot offer support for all the possibile values of the tuple (OS, Python version, console encoding). Therefore, only problems running Penelope in a Linux environment will receive full priority.
If you want to contribute some code or you have suggestions, please let me know by sending an email containing the word "Penelope" in the subject. Thanks!
Acknowledgments
Many thanks to:
- uwelovesdonna for contributing ideas for improving the code and for setting up many pages of the project wiki;
- Jens Sadowski for pointing out a bug with Unicode file names and for suggesting using multiset
dict()instead of setdict(); - oldnat for pointing out a bug under Windows and Python 3.x;
- Wolfgang Miller-Reichling for providing the code for reading CSV dictionaries;
- branok for providing the idea and initial code for German collation function;
- pal for suggesting passing -l switch to
MARISA_BUILD.
If you enjoyed reading this page or using my conversion script, you can send me a "thank-you" email.
If you really enjoyed this work and you feel really grateful to me for writing the conversion script, I would really love to receive a (reasonably recent) 9-inch e-reader or tablet for testing purposes.
If you really really enjoyed my work on this project and you think my brain can help you, I am always glad to hear about job collaborations!
In all three cases, contact me via email, thanks!
Links
- The project files at Google Code
- Related SBF Thread (Italian)
- MobileRead Thread about the dictionaries for Odyssey
- MobileRead Thread about the dictionaries for Kobo 1
- MobileRead Thread about the dictionaries for Kobo 2
- StarDict format
- XDXF format
- Bookeen homepage
- Kobo homepage
- List of ISO 639-1 codes for languages
- Old page, containing a detailed discussion of the dictionary format used by Bookeen Odyssey and Kobo devices