Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support incremental crawling #749

Open
xqbumu opened this issue Jan 31, 2024 · 6 comments · May be fixed by #827
Open

Support incremental crawling #749

xqbumu opened this issue Jan 31, 2024 · 6 comments · May be fixed by #827
Assignees
Labels
Investigation Type: Enhancement Most issues will probably ask for additions or changes.

Comments

@xqbumu
Copy link

xqbumu commented Jan 31, 2024

Please describe your feature request:

From the current logic, in Katana, setting 'srd' allows you to save the crawled content. However, when executing it for the second time, the content in that directory will be cleared. I hope to support incremental crawling, which means:

  1. The directory should not be cleared during the second execution;
  2. When encountering a request that has already been saved, skip crawling that link.

Describe the use case of this feature:

Replacing the -nc option in wget: use cases

wget -P ./output -nc -i urls.txt

Refer: https://github.com/projectdiscovery/katana/blob/main/pkg/output/output.go#L120

@xqbumu xqbumu added the Type: Enhancement Most issues will probably ask for additions or changes. label Jan 31, 2024
@dogancanbakir dogancanbakir self-assigned this Jan 31, 2024
@dogancanbakir
Copy link
Member

Thanks for opening this issue. I don't remember the specifics, but if -resume is specified, previously crawled content should not be removed. I'll look into this.

@xqbumu
Copy link
Author

xqbumu commented Feb 1, 2024

Thanks for opening this issue. I don't remember the specifics, but if -resume is specified, previously crawled content should not be removed. I'll look into this.

Thank you for your reply. I have used this switch (-resume), but it only works for resuming interrupted crawling. When I modify my urls.txt file, Katana should not be able to perform incremental crawling.

@dogancanbakir
Copy link
Member

@xqbumu,
Makes sense!

@Mzack9999,
Thoughts? - "Incremental Crawling" sounds good to me 💭

@Mzack9999
Copy link
Member

This is for sure an interesting feature, but I'm not sure it can be fully applied to the crawling process. While it's easy to mimic it avoiding overwriting existing files, abandoning the crawling process requires some more thoughts, as it can't be simply based on the existence of the file, as for example, it would end the crawling at the very beginning since the root branch already exist. Maybe some better strategy can be adopted, for example:

  • Crawl normally till a minimum depth (2?)
  • Above that depth, if the crawled page is the same of existing one (or all children of parent node are the same above a certain threshold) => break out

What do you think?

@xqbumu
Copy link
Author

xqbumu commented Mar 1, 2024

@Mzack9999

Thank you for your response. My initial expectation was to be able to continue crawling the remaining links after an interruption. However, the re-crawling strategy you mentioned here seems to enhance the ability to resume crawling.

As for the re-crawling strategy, I feel that in addition to defining the depth of the links, it could also consider judging based on the modification time of the crawled files, as it is easier to determine data updates based on time.

The above is just my personal opinion, and I welcome your guidance.

@dogancanbakir
Copy link
Member

@Mzack9999,

My initial expectation was to be able to continue crawling the remaining links after an interruption.

Let's begin with this idea and then gradually develop it further. What do you say?

@dogancanbakir dogancanbakir linked a pull request Mar 28, 2024 that will close this issue
@dogancanbakir dogancanbakir linked a pull request Mar 28, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Investigation Type: Enhancement Most issues will probably ask for additions or changes.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants