Support full system-wide document searching #13

rheasman · 2023-04-26T19:05:07Z

The current version of Semantra (0.1.3) is a good start, but unfortunately cannot be used as a generic PDF search engine (though it is very close!).

I indexed every PDF on my linux machine by running the following command:
find / -iname "*.pdf" -type f -print0| xargs -0 semantra --no-server

I note that:

Only one CPU core is used.
It is not clear to me that I could rewrite my find command to safely run 10 copies of Semantra in parallel, so I didn't.
Once I had everything indexed, there is no way to tell Semantra to just search over all already indexed files.
Using the same command for indexing and viewing violates separation of concerns.

Also, I note that when run without arguments. Semantra indicates that it accepts an optional filename (using []), but does not actually accept such input:

$ semantra
Usage: semantra [OPTIONS] [FILENAME]...
Try 'semantra --help' for help.

Error: Must provide a filename to process/query

I appreciate the work done so far, and have the following suggestions:

If nothing else, provide a command line option to Semantra to make it search all already-indexed files, or default to this behaviour when no file name is provided.
Separate Semantra into two files. One for indexing and one for searching. Allow indexers to run in parallel.
Allow the search webapp to run independently of indexers, so I can add files to the index without fear of breaking the webapp's search capabilities, and can leave the search window open 24/7. This could hopefully be as simple as only moving the index files out of a temporary directory and into the semantra search directory after indexing is completed.

The text was updated successfully, but these errors were encountered:

freedmand · 2023-04-27T04:48:45Z

Thanks for the detailed write-up. This is clearly the way to go for 0.2.0 (and it was not obvious to me when I first started creating Semantra <1mo ago that it could actually be useful in this sense). I think this separation of concerns also ties into an idea of being able to add/remove files from the frontend itself (e.g. you could launch semantra without args and then open things in the UI).

This will take some design/thought to do elegantly, so I'll think further on this. But I'm interested in any more detailed design ideas you or anyone else might have on it.

martoiu · 2023-05-01T06:42:15Z

It would be nice if one could make indexes that contain some selected folders, like dtSearch. I write a book and have folders for books, articles, documents, news. Then I have other projects with more folders. Thus I would like to search only in the folders of the projects.

yych42 · 2023-05-26T15:18:39Z

https://github.com/jdagdelen/hyperDB

This might be a relevant option for on-device vector search.

freedmand added the enhancement New feature or request label Apr 27, 2023

freedmand mentioned this issue Apr 30, 2023

Fantastic. Enhancement ideas #24

Open

freedmand added this to the Semantra 0.2.0 milestone May 4, 2023

freedmand mentioned this issue May 30, 2023

Newbie question. I'm not sure how to run the Web UI server without the embedding process. #41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support full system-wide document searching #13

Support full system-wide document searching #13

rheasman commented Apr 26, 2023

freedmand commented Apr 27, 2023

martoiu commented May 1, 2023

yych42 commented May 26, 2023

Support full system-wide document searching #13

Support full system-wide document searching #13

Comments

rheasman commented Apr 26, 2023

freedmand commented Apr 27, 2023

martoiu commented May 1, 2023

yych42 commented May 26, 2023