Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline links seem broken when converting de-drm epub to org file #8470

Closed
tillydray opened this issue Dec 4, 2022 · 6 comments · May be fixed by #8482
Closed

Inline links seem broken when converting de-drm epub to org file #8470

tillydray opened this issue Dec 4, 2022 · 6 comments · May be fixed by #8482
Labels

Comments

@tillydray
Copy link

tillydray commented Dec 4, 2022

First off, I love pandoc, and every time I read its documentation I learn new tricks it can do :D Thank you for all your hard work!

Explain the problem

I realize this may not be a pandoc issue, there are several pieces of software involved in going from DRMed epub file to org file, and any one of them may be causing this problem. But the reason I suspect pandoc may be causing the problem is that the epub file looks and works fine in Apple Books, Emacs, and Calibre, so I believe the input file is fine. The org file also looks fine, but does not work fine, ie clicking on links doesn't work. So it seems to me, perhaps naively, that pandoc isn't quite creating the org file correctly.

I may be missing a command line flag or something obvious, but I spent a couple of hours reading the docs and trying to figure it out so it isn't obvious to me 😅

What Happened

In Emacs, when pressing RET on a link, I get this error output No match for custom ID: hcp-nrsvuebib-0010.xhtml#otpt.

What Did I Expect to Happen

I expected to jump to the link

Inputs

My input file is NRSVue, Holy Bible. If you need a copy to reproduce this let me know and I can provide. I used Calibre with DeACSM and DeDRM plugins to remove DRM.

Command Line Inputs

Below are various commands I used, all producing the same issue, copied and pasted from my terminal. I was grasping at straws to try to solve the problem, and read through nearly all of the man pages but didn't see anything that might help.

  • $ pandoc -s NRSVue,\ Holy\ Bible\ -\ Zondervan,.epub --from=epub --to=org --output=nrsvue.org
  • $ pandoc -s NRSVue,\ Holy\ Bible\ -\ Zondervan,.epub --from=epub --to=org --output=nrsvue.org --file-scope
  • $ pandoc -s NRSVue,\ Holy\ Bible\ -\ Zondervan,.epub --from=epub --to=org --output=nrsvue.org --file-scope --normalize
  • $ pandoc -s NRSVue,\ Holy\ Bible\ -\ Zondervan,.epub --from=epub --to=org --output=nrsvue.org --file-scope --extract-media=./
  • $ pandoc -s NRSVue,\ Holy\ Bible\ -\ Zondervan,.epub --from=epub --to=org --output=nrsvue.org --toc
  • $ pandoc -s NRSVue,\ Holy\ Bible\ -\ Zondervan,.epub --from=epub --to=org --output=nrsvue.org --reference-links

Minimal Output Example

Details
[[#hcp-nrsvuebib-0005.xhtml#otbooks][Old Testament Table of Contents]]

--------------

[[#hcp-nrsvuebib-0010.xhtml#otpt][OLD TESTAMENT]]

--------------

[[#hcp-nrsvuebib-0011.xhtml#bk01][Genesis]]

[[#hcp-nrsvuebib-0011.xhtml#ch01001][1]] |

[[#hcp-nrsvuebib-0010.xhtml#ot][The Old Testament]]

| [[#hcp-nrsvuebib-0011.xhtml#bk01][Genesis]]       | [[#hcp-nrsvuebib-0011.xhtml#bk01][Gen]]     |

<<hcp-nrsvuebib-0010.xhtml>>

<<hcp-nrsvuebib-0010.xhtml#otpt>>

<<hcp-nrsvuebib-0010.xhtml#ot>>
[[#hcp-nrsvuebib-0005.xhtml#otbooks][The Old Testament]]

<<hcp-nrsvuebib-0011.xhtml>>

<<hcp-nrsvuebib-0011.xhtml#bk01>>

<<hcp-nrsvuebib-0011.xhtml#ch01001>>
[[#hcp-nrsvuebib-0005.xhtml#rbk01][Genesis]]

[[#hcp-nrsvuebib-0005.xhtml#rbk01][Genesis 1]]

Six Days of Creation and the Sabbath

1When God began to create[[#hcp-nrsvuebib-0013.xhtml#fn01001001-1][a]] the heavens and the earth, 2the earth was complete

<<hcp-nrsvuebib-0011.xhtml#ch01002>>

Software versions

pandoc 2.19.2
Compiled with pandoc-types 1.22.2.1, texmath 0.12.5.2, skylighting 0.13,
citeproc 0.8.0.1, ipynb 0.2, hslua 2.2.1
Scripting engine: Lua 5.4
User data directory: ~/.local/share/pandoc
Copyright (C) 2006-2022 John MacFarlane. Web:  https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
@tillydray tillydray added the bug label Dec 4, 2022
@jgm
Copy link
Owner

jgm commented Dec 4, 2022

I think the issue is really in the EPUB reader and it can be shown with this simple example:

% pandoc -o my.epub
# One

link to [twosub](#twosub)

# Two

ok

## twosub

ok
[WARNING] This document format requires a nonempty <title> element.
  Defaulting to '-' as the title.
  To specify a title, use 'title' in metadata or --metadata title="...".

% pandoc my.epub -t html
<p><span id="ch001.xhtml"></span></p>
<section id="ch001.xhtml#one" class="level1" data-number="1">
<h1 data-number="1">One</h1>
<p>link to <a href="#ch002.xhtml#twosub">twosub</a></p>
</section>
<p><span id="ch002.xhtml"></span></p>
<section id="ch002.xhtml#two" class="level1" data-number="2">
<h1 data-number="2">Two</h1>
<p>ok</p>
<section id="ch002.xhtml#twosub" class="level2" data-number="2.1">
<h2 data-number="2.1">twosub</h2>
<p>ok</p>
</section>
</section>

Here we get things like a reference to #ch002.xhtml#twosub.
The fragment shouldn't contain the character #. I don't know if that's the only issue for org, but it may be one issue.
You could try changing

[[#hcp-nrsvuebib-0005.xhtml#otbooks][Old Testament Table of Contents]]

in your org output to

[[#hcp-nrsvuebib-0005.xhtml_otbooks][Old Testament Table of Contents]]

and changing

<<hcp-nrsvuebib-0005.xhtml#otbooks>>

to

<<hcp-nrsvuebib-0005.xhtml_otbooks>>

and see if that fixes the link. That would be good for me to know.

@tillydray
Copy link
Author

tillydray commented Dec 10, 2022 via email

@jgm
Copy link
Owner

jgm commented Dec 11, 2022

I'm copying the code from above since replies from email don't render as markdown:

[[#hcp-nrsvuebib-0005.xhtml_otbooks][Old Testament Table of Contents]]

--------------

[[#hcp-nrsvuebib-0010.xhtml_otpt][OLD TESTAMENT]]

--------------

[[#hcp-nrsvuebib-0011.xhtml_bk01][Genesis]]

[[#hcp-nrsvuebib-0011.xhtml_ch01001][1]] |

[[#hcp-nrsvuebib-0010.xhtml_ot][The Old Testament]]

| [[#hcp-nrsvuebib-0011.xhtml_bk01][Genesis]]       | [[#hcp-nrsvuebib-0011.xhtml_bk01][Gen]]     |

<<hcp-nrsvuebib-0010.xhtml>>

<<hcp-nrsvuebib-0010.xhtml_otpt>>

<<hcp-nrsvuebib-0010.xhtml_ot>>
[[#hcp-nrsvuebib-0005.xhtml_otbooks][The Old Testament]]

<<hcp-nrsvuebib-0011.xhtml>>

<<hcp-nrsvuebib-0011.xhtml_bk01>>

<<hcp-nrsvuebib-0011.xhtml_ch01001>>
[[#hcp-nrsvuebib-0005.xhtml_rbk01][Genesis]]

[[#hcp-nrsvuebib-0005.xhtml_rbk01][Genesis 1]]

Six Days of Creation and the Sabbath

1When God began to create[[#hcp-nrsvuebib-0013.xhtml_fn01001001-1][a]] the heavens and the earth, 2the earth was complete

<<hcp-nrsvuebib-0011.xhtml_ch01002>>

@tillydray
Copy link
Author

The problem is that link to [[#ch002.xhtml#twosub][twosub]] should be link to [[ch002.xhtml_twosub][twosub]]. So remove the first # and replace the internal # with _. Once I do that it works as expected

tillydray added a commit to tillydray/pandoc that referenced this issue Dec 12, 2022
@tillydray
Copy link
Author

I did two naive find-replaces :%s/#hcp/hcp/g and :%s/xhtml#/xhtml_/g and that fixed some but not all.

Messiah,[[hcp-nrsvuebib-0137.xhtml_fn40001001-3][c]] the son of David, is supposed to jump to [[hcp-nrsvuebib-0136.xhtml_rfn40001001-3][c]] 1.1 Or /Jesus Christ/ but doesn't. When I reconcile their differences, it still doesn't jump. I get this error output No match for fuzzy expression: hcp-nrsvuebib-0137.xhtml_fn40001001-3

@jgm
Copy link
Owner

jgm commented Apr 21, 2024

I believe the issues here have been fixed by now (esp. the misplaced #). Closing this issue as stale.

@jgm jgm closed this as completed Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants