EPUB reading systems vs invalid OCFs

RSS  •  Permalink  •  Created 08 Jun 2014  •  Written by Alberto Pettarin

Prompted by a Twitter conversation generated by my previous post, I investigated whether some popular, real world EPUB reading systems actually check for OCF conformance when loading an EPUB file. You know, for science.

Methodology

I started with creating this valid EPUB file.

Then, I derived from it three variations:

  • testocf1.epub by just zipping all the files with zip -r testocf1.epub *,
  • testocf2.epub by also deleting the mimetype file,
  • testocf3.epub by also deleting the mimetype file and the META-INF directory.

(I also modified the title and the UUID of each EPUB file, to avoid caching effects.)

Clearly, these files are detected as invalid by EpubCheck:

$ epubcheck testocf1.epub 
Epubcheck Version 3.0.1

ERROR: testocf1.epub: Mimetype entry missing or not the first in archive
Validating against EPUB version 2.0

Check finished with warnings or errors

$ epubcheck testocf2.epub 
Epubcheck Version 3.0.1

ERROR: testocf2.epub: Mimetype entry missing or not the first in archive
Validating against EPUB version 2.0

Check finished with warnings or errors

$ epubcheck testocf3.epub 
Epubcheck Version 3.0.1

ERROR: testocf3.epub: Mimetype entry missing or not the first in archive
ERROR: testocf3.epub: Required META-INF/container.xml resource is missing

Check finished with warnings or errors

Please observe that in the third case, the content validation is not performed, as the container.xml is not present, so the OPF cannot be read.

I tried to sideload these three files on the following reading systems:

  • Blio (iOS)
  • Bluefire Reader (iOS)
  • Calibre (Debian)
  • Kobo Glo (eReader)
  • iBooks (iOS)
  • Lektz (iOS)
  • Lucifox (plugin Mozilla Firefox, Debian)
  • Lydhor (iOS)
  • Marvin (iOS)
  • Menestrello (Android, iOS)
  • Readium (plugin Google Chrome, Debian)
  • txtr (iOS)

And verified whether the files were sideloaded successfully and available to the user.

I chose these ones because I have them readily at hand this morning, they all allow sideloading, and they are quite popular. Building an extensive survey was not the purpose of this experiment.

Results

All the above reading systems open testocf1.epub (incorrect zipping).

All the above reading systems open testocf2.epub (no mimetype), except iBooks which alerts the user that the file is invalid.

On testocf3.epub (no mimetype and META-INF), there are several different behaviors:

  • Blio does not open the book and it does not alert the user,
  • Bluefire Reader creates a dummy "ePub" item in the library that cannot be opened (but it can be deleted),
  • Calibre opens the file (I guess it searches for an OPF file in the container),
  • iBooks does not open the book and it does not alert the user,
  • Kobo Glo reports the file as "protected by Adobe DRM" (and it does not open it, as my test device is not registered with Adobe),
  • Lektz alerts the user that the file is invalid,
  • Lucifox alerts the user that the file is invalid,
  • Lydhor crashes,
  • Marvin opens the file (I guess it searches for an OPF file in the container),
  • Menestrello alerts the user that the file is invalid,
  • Readium alerts the user that the file is invalid,
  • txtr crashes.

Comments

It looks like no tested reading system actually checks whether an EPUB file has been properly zipped. The same observation, except for iBooks, holds for the presence of the mimetype file.

These two facts support the argumentation of those proposing to ditch it in a future version of the EPUB specification, (at least for what concerns reading systems).

(Note: iBooks seems to also check that mimetype actually starts with the application/epub+zip string, but it does not check if it contains only that. See testocf4.epub, testocf5.epub, and testocf6.epub.)

On the other hand, almost all the reading systems do not render the third test file, since they miss the location of the OPF file, which must be coded in the (missing) META-INF/container.xml. The exceptions are Calibre and Marvin, which I guess take an heuristic approach, looking for an OPF file in the EPUB container anyway. (And, I do not like it. At least, alert the user that the file is malformed.)

It is unclear to me whether multiple renditions (say, XHTML+SVG, or XHTML+PDF) are actually popular, or we are really going to have multiple OPF files (in EPUB2, having multiple OPF-based rootfile elements was discouraged, and only the first one was to be processed), so that META-INF/container.xml will actually prove useful.

Open problems and future work

What do you think about this issue? Is the complexity of this (and similar) mechanism necessary? What use scenarios does it serve? Shall the IDPF consider simplifying the OCF (and other aspects of EPUB)?