Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keepSeparator doesn't work with RecursiveCharacterTextSplitter #5151

Open
5 tasks done
guidev opened this issue Apr 19, 2024 · 4 comments
Open
5 tasks done

keepSeparator doesn't work with RecursiveCharacterTextSplitter #5151

guidev opened this issue Apr 19, 2024 · 4 comments
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@guidev
Copy link
Contributor

guidev commented Apr 19, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain.js documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain.js rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

        
const text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.\nQuisque placerat nibh ex, eleifend varius dolor vulputate nec.\nNunc euismod ligula in leo tempus, vel lobortis leo accumsan. Phasellus elementum, nisi id tincidunt imperdiet, felis massa luctus diam, id scelerisque nisi mi et libero.\nDonec pulvinar turpis eu ipsum ullamcorper scelerisque.\nMauris non nibh nisi. Cras urna nulla, maximus sed mattis vel, gravida vel sapien. Aenean in diam a libero ultrices elementum.\nPellentesque ac ex nunc. Praesent nunc magna, viverra at urna vitae, aliquam placerat libero. Aenean vel turpis porta, semper velit eget, hendrerit ipsum.";

const document = new Document({pageContent: text});

const textSplitter = new RecursiveCharacterTextSplitter({chunkSize: 200, chunkOverlap: 40, keepSeparator: true});

const splitDocs = await textSplitter.splitDocuments([document], {
   chunkHeader: "CHUNK HEADER",
   appendChunkOverlapHeader: true,
});

console.log(JSON.stringify(splitDocs, null, 2));

Error Message and Stack Trace (if applicable)

No response

Description

I get these chunks:

You can see that second chunks starts with CHUNK HEADER(cont'd) Nunc, but I would expect it to start with CHUNK HEADER(cont'd) Nunc OR the first chunk should end with \n.

The \n between vulputate nec.\nNunc euismod is lost.

[
   {
      "pageContent":"CHUNK HEADERLorem ipsum dolor sit amet, consectetur adipiscing elit.\nQuisque placerat nibh ex, eleifend varius dolor vulputate nec.",
      "metadata":{
         "loc":{
            "lines":{
               "from":1,
               "to":2
            }
         }
      }
   },
   {
      "pageContent":"CHUNK HEADER(cont'd) Nunc euismod ligula in leo tempus, vel lobortis leo accumsan. Phasellus elementum, nisi id tincidunt imperdiet, felis massa luctus diam, id scelerisque nisi mi et libero.",
      "metadata":{
         "loc":{
            "lines":{
               "from":3,
               "to":3
            }
         }
      }
   },
   {
      "pageContent":"CHUNK HEADER(cont'd) Donec pulvinar turpis eu ipsum ullamcorper scelerisque.\nMauris non nibh nisi. Cras urna nulla, maximus sed mattis vel, gravida vel sapien. Aenean in diam a libero ultrices elementum.",
      "metadata":{
         "loc":{
            "lines":{
               "from":4,
               "to":5
            }
         }
      }
   },
   {
      "pageContent":"CHUNK HEADER(cont'd) Pellentesque ac ex nunc. Praesent nunc magna, viverra at urna vitae, aliquam placerat libero. Aenean vel turpis porta, semper velit eget, hendrerit ipsum.",
      "metadata":{
         "loc":{
            "lines":{
               "from":6,
               "to":6
            }
         }
      }
   }
]

System Info

langchain@0.1.34
macOS
node 18.20.2

Copy link

dosubot bot commented Apr 19, 2024

It looks like the issue you're encountering with the keepSeparator option not working as expected in the RecursiveCharacterTextSplitter is due to how separators are handled during the text splitting process. Specifically, when keepSeparator is set to true, the implementation should ensure that the separator (\n in your case) is preserved between chunks. However, it seems like the current logic mistakenly omits the separator from the final output.

To address this, you can modify the part of the code where the separator is managed. The correct approach would be to include the separator in the merged text only when keepSeparator is true. This involves adjusting the logic in the mergeSplits method or similar, where the decision to append the separator to the text is made.

Here's a brief outline of what the corrected code might look like:

mergeSplits(splits: string[], separator: string): string[] {
    // Adjust the logic here to ensure separator is included based on the keepSeparator flag
    const doc = this.joinDocs(currentDoc, this.keepSeparator ? separator : "");
    // Further logic to handle the splits
}

This adjustment ensures that the separator is correctly preserved in the output when keepSeparator is true, addressing the issue you've observed.

Given this insight, I recommend applying this fix to your local copy of the LangChain.js library and testing to see if it resolves the problem. If successful, consider contributing this fix back to the LangChain.js repository to help others who might face the same issue.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Apr 19, 2024
@jeasonnow
Copy link
Contributor

jeasonnow commented Apr 25, 2024

In simple terms, the work of the Splitter can be summarized in the following steps:

  1. Perform text segmentation, retaining the separating symbols if the keepSeperator flag is set, such as \n.
  2. Iterate through the segmented strings, and if the length is less than the set chunkSize, merge them sequentially.
  3. After merging, iterate through each merged string to trim it, removing any leading or trailing whitespace characters.
  4. Generate each Chunk based on the merged result, attaching a Chunk Header to each.

private joinDocs(docs: string[], separator: string): string | null {
const text = docs.join(separator).trim();
return text === "" ? null : text;
}

The cause of this issue is that when merging to generate pageContent, the method String.prototype.trim is used to filter out spaces at the beginning and end of the string. However, \n is also filtered out by trim. Therefore, when there is a split with a \n at the beginning, the \n at the beginning and end will be removed.

@guidev

@jeasonnow
Copy link
Contributor

jeasonnow commented Apr 25, 2024

However, based on the following reasons, I believe this issue may not require modification:

  1. Leading and trailing whitespace characters do not affect the semantics of the text split out.
  2. If excessive special whitespace characters (\n) appear at the beginning and end without filtering, it may generate more chunks, leading to additional token consumption.

So I closed my PR. If you have any better suggestions, we can discuss them.

@wcummings
Copy link
Contributor

I found this to be a problem, this is my workaround:

// Fixed version of RecursiveCharacterTextSplitter which respects keepSeparator
class FixedRecursiveCharacterTextSplitter extends RecursiveCharacterTextSplitter {
  async mergeSplits(splits: string[], separator: string): Promise<string[]> {
    const docs: string[] = [];
    const currentDoc: string[] = [];
    let total = 0;
    for (const d of splits) {
      const _len = await this.lengthFunction(d);
      if (
        total + _len + currentDoc.length * separator.length >
        this.chunkSize
      ) {
        if (total > this.chunkSize) {
          console.warn(
            `Created a chunk of size ${total}, +
which is longer than the specified ${this.chunkSize}`,
          );
        }
        if (currentDoc.length > 0) {
          const doc = this._joinDocs(currentDoc, separator);
          if (doc !== null) {
            docs.push(doc);
          }
          // Keep on popping if:
          // - we have a larger chunk than in the chunk overlap
          // - or if we still have any chunks and the length is long
          while (
            total > this.chunkOverlap ||
            (total + _len + currentDoc.length * separator.length >
              this.chunkSize &&
              total > 0)
          ) {
            total -= await this.lengthFunction(currentDoc[0]);
            currentDoc.shift();
          }
        }
      }
      currentDoc.push(d);
      total += _len;
    }
    const doc = this._joinDocs(currentDoc, separator);
    if (doc !== null) {
      docs.push(doc);
    }
    return docs;
  }

  // Do not trim if keepSeparator is enabled, or whitespace separators may be erased
  private _joinDocs(docs: string[], separator: string): string | null {
    const text = this.keepSeparator
      ? docs.join(separator)
      : docs.join(separator).trim();
    return text === "" ? null : text;
  }
}

If this is an acceptable solution, I'll make a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants