Replies: 1 comment 1 reply
-
Hey @fin-lay-b, nice to see you diving into the depths of LlamaIndex again! How's everything on your end? LlamaIndex handles the embedding of encoded_input = self._tokenizer(
sentences,
padding=True,
max_length=self.max_length,
truncation=True,
return_tensors="pt",
) In this code snippet, For the tokenization of .docx files specifically, LlamaIndex uses the class DocxReader(BaseReader):
"""Docx parser."""
def load_data(
self, file: Path, extra_info: Optional[Dict] = None
) -> List[Document]:
"""Parse file."""
try:
import docx2txt
except ImportError:
raise ImportError(
"docx2txt is required to read Microsoft Word files: "
"`pip install docx2txt`"
)
text = docx2txt.process(file)
metadata = {"file_name": file.name}
if extra_info is not None:
metadata.update(extra_info)
return [Document(text=text, metadata=metadata or {})] You can find this in the llama-index-legacy/llama_index/legacy/readers/file/docs_reader.py file. To handle tokenization limits, LlamaIndex splits I hope this helps! If you have any more questions, feel free to ask. Sources
|
Beta Was this translation helpful? Give feedback.
-
This is particularly aimed at loading .docx files.
My understanding was that nodes were embedded to generate indexes, however .docx files are loaded as one node - so how does llama-index embed a .docx file with more tokens than max for model (8,191 for text-embedding-ada-002 (version 2) as an example)?
Beta Was this translation helpful? Give feedback.
All reactions