EPUB OCF, mimetype, and kittens

RSS  •  Permalink  •  Created 07 Jun 2014  •  Written by Alberto Pettarin

The EPUB Open Container Format (OCF) specification has quite a long list of requirements about the ZIP container encapsulating all the resources of an EPUB Publication, but they all boil down to some (arguably useful) agreed conventions to ease the work of processors dealing with EPUB files (e.g., reading systems).

Perhaps the first "strange" requirement novices crash into is that the mimetype file inside the EPUB must:

  1. contain only the ASCII string application/epub+zip, resulting in a file of exactly 20 bytes,
  2. be stored uncompressed, and,
  3. be the first entry of the ZIP container.

In plain English, if you just zip assets up:

$ ls
META-INF mimetype OEBPS

$ zip -r my_first_ebook.epub *

and then try to validate the resulting file with EpubCheck, you will get an error:

$ epubcheck my_first_ebook.epub
Epubcheck Version 3.0.1

ERROR: my_first_ebook.epub: Mimetype entry missing or not the first in archive
Validating against EPUB version 3.0

Check finished with warnings or errors

OK, so how should I compress the EPUB to make the validator happy?

Well, just follow the specification:

  • first, just store (= no compression) the mimetype file inside a new ZIP file (for convenience, with .epub extension);
  • then, add the remaining files (which the specs allow you to store in compressed form and in any order).

Using the console:

$ zip -X0 my_second_ebook.epub "mimetype"

$ zip -rX9 my_second_ebook.epub "META-INF/" "OEBPS/"

$ epubcheck my_second_ebook.epub
Epubcheck Version 3.0.1

Validating against EPUB version 3.0
No errors or warnings detected.

Great, mission accomplished, let's have a break watching a kitten video!

Wait a minute. Do you know what -X0 and -rX9 mean? Or are you one of those bad, bad folks who try stuff randomly googled from the Internet until they find the "magic one" which solves their problem? Perusing the zip manual (man zip) is always very informative, but here a summary for you, lazy-ass:

  • -0 means "do not compress, just store the file"
  • -9 means "use the highest compression possible"
  • -r means "add files and directories recursively"
  • -X means "do not save extra file attributes inside the ZIP"

I suggest using -X (unless you are an evil person, knowing to be so...), because OCF processors are not required to honor extra file information, and including those extra bits might create troubles to some processors. KISS.

(-D is also interesting, but I will leave you, as an exercise, finding what it does and why you might want to use it.)

Thank you for the lesson, can we watch kitten videos now?

No!

I promised to explain the reason for this strange convention: thanks to the way a ZIP file is created (and some more additional constraints from the EPUB OCF specs that I would not touch with a ten foot pole), we will have the following magic numbers:

  • 0x50 0x4b 0x03 0x04 (at bytes 0-3)
  • mimetype (at bytes 30-37)
  • application/epub+zip (at bytes 38-57)

Indeed, if you open an EPUB with an hex editor, you will see something like:

$ hexdump -Cv my_second_ebook.epub | head -n 8
00000000  50 4b 03 04 0a 00 02 00  00 00 aa 5e 91 44 6f 61  |PK.........^.Doa|
00000010  ab 2c 14 00 00 00 14 00  00 00 08 00 00 00 6d 69  |.,............mi|
00000020  6d 65 74 79 70 65 61 70  70 6c 69 63 61 74 69 6f  |metypeapplicatio|
00000030  6e 2f 65 70 75 62 2b 7a  69 70 50 4b 03 04 14 00  |n/epub+zipPK....|
00000040  02 00 08 00 aa 5e 91 44  af 22 ca f2 b0 00 00 00  |.....^.D."......|
00000050  f6 00 00 00 16 00 00 00  4d 45 54 41 2d 49 4e 46  |........META-INF|
00000060  2f 63 6f 6e 74 61 69 6e  65 72 2e 78 6d 6c 55 8e  |/container.xmlU.|
00000070  41 0e 82 30 14 44 d7 72  0a d2 ad 81 82 2c 84 a6  |A..0.D.r.....,..|

In theory, this allows an OCF processor to recognize that a given file might be an EPUB file, by just looking at those magic number bytes.

(Note: the usefulness and modernity of such a convention might be the subject of a heated debate. I personally think that relying on these "old days tricks" is dangerous, but it is also true that complying is quite easy and cheap, once you know the "trick".)

Finally, the application/epub+zip Media Type is registered here.

You said: "Finally", so are we done now?

Well, the constraints on the mimetype and the OCF zipping are just some of the many rules governing the EPUB OCF. If you want to delve into the details, or you need specific features (e.g., asset obfuscation), go read the specs. If you want a TL;DR version, follow this recipe:

  1. download this mimetype file and always use it
  2. download this container.xml file, put it into your META-INF directory, and always use it
  3. the previous point implies that your OPF file should be OEBPS/content.opf
  4. put your eBook assets inside the OEBPS directory
  5. stick to ASCII names for your asset files, using only [0-9A-Za-z.] to avoid characters that might need escaping and/or might not be supported by (old) EPUB processors (e.g., space, slash, non-ASCII, etc.)
  6. decide a naming convention for your asset files (e.g., OEBPS/Text, p001.xhtml, etc.)
  7. compress the ZIP container properly, as explained above

If you made until down here, you earned it: go watch some cute kitten videos.