Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

structure-aware chunking of code and markdown #441

Open
pchalasani opened this issue Apr 1, 2024 · 3 comments
Open

structure-aware chunking of code and markdown #441

pchalasani opened this issue Apr 1, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@pchalasani
Copy link
Contributor

The chunker in CodeParser is not structure-aware. We should use something based on treesitter that produces an AST of code structure (even markdown). Especially markdown is a useful case because it can be an intermediate stage in chunking pdf docs (i.e. pdf -> markdown with headers -> structure-aware chunks)

@pchalasani pchalasani added the enhancement New feature or request label Apr 1, 2024
@Mohannadcse
Copy link
Collaborator

is there a specific use case?

@pchalasani
Copy link
Contributor Author

Structure-aware chunking in general is good to have. E.g. in a markdown doc, it's good to avoid having a logically coherent section broken up, as long as chunk size limits and overlap params are respected.

@Mohannadcse
Copy link
Collaborator

I can work on this issue if it's not assigned yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants