Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong symbol display with certain epubs #4105

Open
Pocokk opened this issue Feb 25, 2024 · 5 comments
Open

Wrong symbol display with certain epubs #4105

Pocokk opened this issue Feb 25, 2024 · 5 comments

Comments

@Pocokk
Copy link

Pocokk commented Feb 25, 2024

SumatraPDF version

  • v3.6.15966 64 bit (pre release

Describe the bug
Instead of — symbol, the reader shows them as â��

To Reproduce

  1. Open the provided book
  2. Scroll down a bit and see the issue

Expected behavior
It should properly display the — symbol

File that reproduces the problem
https://pixeldrain.com/u/uoX1M1gU (using this file host because github doesn't allow to directly upload epubs here and also this file host retains the files up to 90 days, so its durable)

Screenshots
Title Page https://i.imgur.com/Qn16Qif.png
Begin Reading segment https://i.imgur.com/WojW3bn.png

Additional context
I noticed a strange  symbol at the beginning of "Begin Reading" bookmark also, that shouldn't be displayed either.
Also tested on other readers, the original symbol is correctly displayed.

@GitHubRulesOK
Copy link
Collaborator

GitHubRulesOK commented Feb 25, 2024

The Ablah is a common sign of Unicode/UTF-8 character problems where the mix is unexpected and the characters shown as boxes or Diamond with ? or not at all or a random following character or simply ?

Primarily it stems from the MuPDF core of HTML as seen slightly different in the MuPDF version on the left, (interesting that Edge does not show anything for Chapter Number) The Raw source HTML as shown in Edge is below.

image
image

So the title is <h2 class="chapterTitle" style="text-indent: 0%;"> </h2> and that is what is shown
The Text issues <span class="smallCaps">M</span>. Central Standard Time, on a crisp evening in November—the same night that Major Gene Gavin’s tenure as ersatz leader of Woodbury is terminated with extreme prejudice—the leading edge etc.

So basically it is poor Authorship

Where for a World Wide Web audience where Plain HTML should look like

image

and be same in Edge and SumatraPDF
image

@Pocokk
Copy link
Author

Pocokk commented Mar 9, 2024

How can I manually modify the text to show what it suppose to?

@GitHubRulesOK
Copy link
Collaborator

@Pocokk

If you wish to run through the files you can simply unpack A COPY of them in a workfolder with command line TAR I show that on left

TAR -xf filename.epub

image

we delete the copy of the .epub because we dont want it later when re packing

You go into the folder with HTML files and edit them
But to edit will need a bit of find and replace skill based on UTF coding knowledge which is based on my years of messing them about. see how you get on with any F&R tool your familiar with as most just need the plainer text for reading.

One you are happy they are done simply add all the working directory in a zip folder and rename to file.epub
image
image
image

@Pocokk
Copy link
Author

Pocokk commented Mar 10, 2024

Thank you, figured out how to "fix" it: as I've noticed, the main issue is this line in each .xhtml file:

<meta content="text/html; charset=iso-8859-1" http-equiv="content-type"/>

if I switch that iso-8859-1 to utf-8, all good!

My question is, what's the deal here, why can't Sumatra (MuPDF) properly display something if its NOT utf-8?

@GitHubRulesOK
Copy link
Collaborator

GitHubRulesOK commented Mar 10, 2024

Iso-8859-1 means it is NOT Unicode and is to be shown as 2 single characters but the expectation in those files was wrong way round so it shows the bad characters as directed by that format, but as single characters their meaning is not ASCII

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants