Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support PDF in GDrive #56

Open
yuvalsteuer opened this issue Mar 24, 2023 · 8 comments
Open

Support PDF in GDrive #56

yuvalsteuer opened this issue Mar 24, 2023 · 8 comments
Labels
feature good first issue Good for newcomers

Comments

@yuvalsteuer
Copy link
Contributor

yuvalsteuer commented Mar 24, 2023

See similar parsers here:

@rishi003
Copy link

I will be happy to take this one

@yuvalsteuer
Copy link
Contributor Author

Great! When are you expecting to finish this?

@rishi003
Copy link

I am trying to set up the dev environment right now but getting some issues in the following environment:

  • OS: WSL(Windows 10)
  • Nvidia: No

the function

STORAGE_PATH = Path('/opt/storage/') if IS_IN_DOCKER else Path(f'/home/{os.getlogin()}/.gerev/storage/')

The following line is giving me the below error:

FileNotFoundError: [Errno 2] No such file or directory

Upon researching I found out that os.getlogin() is the culprit.

If you cannot provide any help with this issue, can you describe a proper environment setup that will be suitable for development?

@yuvalsteuer
Copy link
Contributor Author

Just hardcode any path that is valid in WSL.

@rishi003
Copy link

Currently, it is possible to parse the entire content of pdf files as text, but as it's apparent from your parsers, the program needs to compile it in the following form:

Some title: related text
Some other title: related text

Am I right?

There is already a pull request that parses the entire pdf document as text.

If you have any enhancements or suggestions for that, I'll be more than willing to implement them.

Meanwhile, I am also researching how can I parse pdf while keeping the hierarchical information intact.

@Roey7
Copy link
Contributor

Roey7 commented Mar 26, 2023

Hey!
Just like I commented on that other PR, it should be pdf->html
then we parse html>text

@Roey7
Copy link
Contributor

Roey7 commented Mar 26, 2023

@rishi003 let's chat on discord! I could guide you a little bit :)

@rishi003
Copy link

Sure, shall we discuss it on the discuss thread?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants