Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support full system-wide document searching #13

Open
rheasman opened this issue Apr 26, 2023 · 3 comments
Open

Support full system-wide document searching #13

rheasman opened this issue Apr 26, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@rheasman
Copy link

The current version of Semantra (0.1.3) is a good start, but unfortunately cannot be used as a generic PDF search engine (though it is very close!).

I indexed every PDF on my linux machine by running the following command:
find / -iname "*.pdf" -type f -print0| xargs -0 semantra --no-server

I note that:

  • Only one CPU core is used.
  • It is not clear to me that I could rewrite my find command to safely run 10 copies of Semantra in parallel, so I didn't.
  • Once I had everything indexed, there is no way to tell Semantra to just search over all already indexed files.
  • Using the same command for indexing and viewing violates separation of concerns.

Also, I note that when run without arguments. Semantra indicates that it accepts an optional filename (using []), but does not actually accept such input:

$ semantra
Usage: semantra [OPTIONS] [FILENAME]...
Try 'semantra --help' for help.

Error: Must provide a filename to process/query

I appreciate the work done so far, and have the following suggestions:

  • If nothing else, provide a command line option to Semantra to make it search all already-indexed files, or default to this behaviour when no file name is provided.
  • Separate Semantra into two files. One for indexing and one for searching. Allow indexers to run in parallel.
  • Allow the search webapp to run independently of indexers, so I can add files to the index without fear of breaking the webapp's search capabilities, and can leave the search window open 24/7. This could hopefully be as simple as only moving the index files out of a temporary directory and into the semantra search directory after indexing is completed.
@freedmand freedmand added the enhancement New feature or request label Apr 27, 2023
@freedmand
Copy link
Owner

Thanks for the detailed write-up. This is clearly the way to go for 0.2.0 (and it was not obvious to me when I first started creating Semantra <1mo ago that it could actually be useful in this sense). I think this separation of concerns also ties into an idea of being able to add/remove files from the frontend itself (e.g. you could launch semantra without args and then open things in the UI).

This will take some design/thought to do elegantly, so I'll think further on this. But I'm interested in any more detailed design ideas you or anyone else might have on it.

@martoiu
Copy link

martoiu commented May 1, 2023

It would be nice if one could make indexes that contain some selected folders, like dtSearch. I write a book and have folders for books, articles, documents, news. Then I have other projects with more folders. Thus I would like to search only in the folders of the projects.

@freedmand freedmand added this to the Semantra 0.2.0 milestone May 4, 2023
@yych42
Copy link
Contributor

yych42 commented May 26, 2023

https://github.com/jdagdelen/hyperDB

This might be a relevant option for on-device vector search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants