Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdftotext -layout #39

Open
MikhailKlemin opened this issue Nov 2, 2017 · 3 comments
Open

pdftotext -layout #39

MikhailKlemin opened this issue Nov 2, 2017 · 3 comments

Comments

@MikhailKlemin
Copy link

Hello!
I woud use -layout option of pdftotext for that I guess I have to change
body, err := exec.Command("pdftotext", "-q", "-nopgbrk", "-enc", "UTF-8", "-eol", "unix", f.Name(), "-").Output() to add -layout am I correct?

@mish15
Copy link
Member

mish15 commented Nov 3, 2017

@MikhailKlemin yes you would, however it may be worth making that default for all. Do you have any text examples showing the difference with and without the layout option?

@MikhailKlemin
Copy link
Author

MikhailKlemin commented Nov 3, 2017

Hi
For me it makes a lot of sense, since usually I apply a lot of regex after converting to TXT, and -layout really helps to fight the mess. I attached an example with screenshots.
Here are source PDF and convert to txt with and without layout option
https://transfer.sh/WJzz/examples.zip

@mish15
Copy link
Member

mish15 commented Nov 6, 2017

@MikhailKlemin we normally have to clean up the whitespace, so we'd need to test this internally to see what happens. I think it's worth adding as an option. I would look at adding some ENV options to control this. What's your timeframe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants