How does llama-index deal with nodes with tokens greater than max tokens for embedding model? #12095

fin-lay-b · 2024-03-19T21:19:02Z

fin-lay-b
Mar 19, 2024

This is particularly aimed at loading .docx files.
My understanding was that nodes were embedded to generate indexes, however .docx files are loaded as one node - so how does llama-index embed a .docx file with more tokens than max for model (8,191 for text-embedding-ada-002 (version 2) as an example)?

@fin-lay-b · 2024-03-19T21:19:14Z

dosubot[bot]
bot Mar 19, 2024

Hey @fin-lay-b, nice to see you diving into the depths of LlamaIndex again! How's everything on your end?

LlamaIndex handles the embedding of .docx files that contain more tokens than the maximum allowed by the model by truncating the input text to the max_length specified for the model. This is done in the _embed method of the IntelEmbedding class, where the tokenizer is called with the truncation=True parameter. This ensures that any input text longer than max_length is truncated to this length before being processed by the model. Here is the relevant code snippet:

encoded_input = self._tokenizer(
    sentences,
    padding=True,
    max_length=self.max_length,
    truncation=True,
    return_tensors="pt",
)

In this code snippet, max_length is used to define the maximum number of tokens that the tokenizer will consider from each input text. If the input text exceeds this length, it will be truncated to fit the model's constraints, ensuring compatibility with models like text-embedding-ada-002 version 2, which has a maximum token limit (e.g., 8,191 tokens). You can find this in the llama-index-integrations/embeddings/llama-index-embeddings-huggingface-optimum-intel/llama_index/embeddings/huggingface_optimum_intel/base.py file.

For the tokenization of .docx files specifically, LlamaIndex uses the docx2txt library to process the .docx file and extract its text content. This process is distinct from how it handles other file types, such as PDFs and HWP files, which use different libraries and methods for text extraction. The DocxReader class is responsible for parsing these files. It uses the docx2txt.process(file) method to extract the text from the .docx file. The extracted text and any additional metadata (such as the file name) are then used to create a Document object, which is returned in a list. Here's the relevant code snippet from the DocxReader class:

class DocxReader(BaseReader):
    """Docx parser."""

    def load_data(
        self, file: Path, extra_info: Optional[Dict] = None
    ) -> List[Document]:
        """Parse file."""
        try:
            import docx2txt
        except ImportError:
            raise ImportError(
                "docx2txt is required to read Microsoft Word files: "
                "`pip install docx2txt`"
            )

        text = docx2txt.process(file)
        metadata = {"file_name": file.name}
        if extra_info is not None:
            metadata.update(extra_info)

        return [Document(text=text, metadata=metadata or {})]

You can find this in the llama-index-legacy/llama_index/legacy/readers/file/docs_reader.py file.

To handle tokenization limits, LlamaIndex splits .docx files, or any text content, into smaller chunks using the TokenTextSplitter class. The primary methods responsible for this functionality are split_text_metadata_aware and split_text. These methods utilize a tokenizer to determine the size of chunks and metadata, ensuring that the resulting chunks are within the tokenization limits. The process involves splitting the text by separators or characters to meet the chunk size requirements, merging splits into chunks, and handling overlaps between chunks to maintain context. You can find this in the llama-index-core/llama_index/core/node_parser/text/token.py file.

I hope this helps! If you have any more questions, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

1 reply

fin-lay-b Mar 19, 2024
Author

@dosu-bot So for .pdf files if there were more than 1024 tokens per page the nodes would be split when embedded?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does llama-index deal with nodes with tokens greater than max tokens for embedding model? #12095

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

About Dosu

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How does llama-index deal with nodes with tokens greater than max tokens for embedding model? #12095

fin-lay-b Mar 19, 2024

Replies: 1 comment · 1 reply

dosubot[bot] bot Mar 19, 2024

Sources

About Dosu

fin-lay-b Mar 19, 2024 Author

fin-lay-b
Mar 19, 2024

Replies: 1 comment 1 reply

dosubot[bot]
bot Mar 19, 2024

fin-lay-b Mar 19, 2024
Author