Auto-detect encoding? #50

endolith · 2023-07-06T21:09:03Z

Documentation says

--encoding: Encoding to use for reading text files [default: utf-8]

But different files have different encodings. Chinese PDF is being read correctly and characters are showing up correctly, but a .txt file in the same folder that's encoded in GB2312 is being garbled in both the search results and the file display.

Probably it should default to detecting the encoding for each file independently and then converting them internally to whatever the embedding expects (UTF8?)

https://pypi.org/project/chardet/

The text was updated successfully, but these errors were encountered:

freedmand · 2023-07-08T03:33:28Z

Yep, you're absolutely right. This should be granular on a per-file basis. I can look into auto-detecting encoding, but that might be time consuming for ever file, and it might be error prone. In any case, v0.2 should have better controls for customizing how Semantra works per file.

freedmand added this to the Semantra 0.2.0 milestone Jul 8, 2023

freedmand added the enhancement New feature or request label Jul 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-detect encoding? #50

Auto-detect encoding? #50

endolith commented Jul 6, 2023

freedmand commented Jul 8, 2023

Auto-detect encoding? #50

Auto-detect encoding? #50

Comments

endolith commented Jul 6, 2023

freedmand commented Jul 8, 2023