Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/text_splitting_separators_does_not_work #2872

Closed
muazhari opened this issue Apr 9, 2024 · 3 comments
Closed

bug/text_splitting_separators_does_not_work #2872

muazhari opened this issue Apr 9, 2024 · 3 comments
Labels
chunking Related to element chunking.

Comments

@muazhari
Copy link

muazhari commented Apr 9, 2024

Describe the bug
When using chunk_elements(), the default parameter value of text_splitting_separators does not have any effect.

To Reproduce

chunked_texts: List[Element] = chunk_elements(
    elements=elements,
    include_orig_elements=True,
    max_characters=50,
    overlap=10,
)

Expected behavior
The chunk_elements() should respect the default parameter value of text_splitting_separators .

Screenshots
image

Environment Info
OS version: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version: 3.10.12
unstructured version: 0.13.2
unstructured-inference version: 0.7.25
pytesseract version: 0.3.10
Torch version: 2.2.0
Detectron2 is not installed
PaddleOCR version: 2.6.1.2
Libmagic version: 1:5.41-3ubuntu0.1
LibreOffice version: LibreOffice 7.3.7.2 30(Build:2)

Additional context
Add additional parameter of text_splitting_separators to chunk_elements() to be able to be customized.

@muazhari muazhari added the bug Something isn't working label Apr 9, 2024
@scanny
Copy link
Collaborator

scanny commented Apr 9, 2024

@muazhari currently the text-splitting characters value applies only when overlap is not used. The "base" chunk is split on an even word boundary, it is the overlap-prefix that is producing the mid-word starts to the chunks. You can see that the end of each chunk is on a word boundary. If you turn off overlap (omit the overlap argument or set it to 0) you'll see the element text also starts on a word boundary.

@scanny scanny added chunking Related to element chunking. and removed bug Something isn't working labels Apr 9, 2024
@muazhari
Copy link
Author

muazhari commented Apr 12, 2024

Is it possible for overlap to respect text_splitting_separators? If possible, when will this feature be implemented?
Example:
C = Character
S = Separator
Initial texts:
Text A = C1C2S1C3S2C4
Text B = C5C6S1C7S2C8
Overlapped texts with respect to one of the separators:
Text A = C1C2S1C3S2C4
Text B = C4C5C6S1C7S2C8

Formal:
[A0, An] [B0, Bn]
[A0, An] [Aoverlap_index, A to B separator, B0, Bn]
overlap_index = last separator index of A + 1

@scanny
Copy link
Collaborator

scanny commented Apr 12, 2024

@muazhari I've added an enhancement request issue for this here: #2886

You can elaborate that issue if you think it needs more. Closing this issue for now since the current behavior is expected (if not always desired :).

@scanny scanny closed this as completed Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chunking Related to element chunking.
Projects
None yet
Development

No branches or pull requests

2 participants