-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Greatly improve AnchorCheck performance #661
Open
pacenathan
wants to merge
8
commits into
linkchecker:master
Choose a base branch
from
pacenathan:AnchorCheckPerformance
base: master
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ab8d616
to
4f48981
Compare
I started with a test of urlencoded anchors, assuming at the URL might have a urlencoded anchor, but the actual anchor in the HTML would NOT be urlencoded. Then discovered that fileurl.py was stripping the anchors from url_data, which breaks AnchorCheck. So I stopped it from doing that, and tried to fix up all the places that were assuming the url would map to a filesystem file. The tests all pass, but I'm not 100% sure I caught all the cases, or fixed them correctly.
Also make pylint happy for the changes I've made
I added a test for file:// processing, and it was showing different results for when the URL anchor was and wasn't quoted. I tracked it down to code in fileurl.py that was calling url_norm, and I'm pretty sure the code is unnecessary at this point. But I made a minimally-invasive change, to be as safe as possible.
Also stop using UrlBase.anchors in favor of a new method that is based on UrlBase.url_data[4], which is more consistent with the rest of the code. The threading issue has been there for years, but I didn't notice it until after I thought I was done, while I was doing manual testing (with threads re-enabled). The problem was with storing URL-specific state (.anchors) on the AnchorCheck object itself, because there's only one global AnchorCheck object, so all the threads are competing to use that one simgle variable (self.anchors). The solution was to create a new object to hold .anchors, for each processed URL.
Also removed the documentation saying AnchorCheck is disabled. A downside to the prior AnchorCheck implementation was that it caused the system to completely re-process a given URL (i.e. a document) each time there was a URL that referenced a new (distinct) anchor in that document. That caused enormous slowdowns. This commit introduces a new cache class and three new places that use that cache, which avoids most of the reprocessing and massively improves performance. Note, however, that even under this implementataion there is still substantial wasted time in e.g. opening new HTTP connections to servers and redoing processing that has already been done once. I don't see any way to remediate that without major invasive surgery. For future reference, I did manual performance testing, with and without this change, on a test set of about 150 reasonably-large HTML files generated from markdown documentation (mkdocs). These times are based on the performance improvements I made in commit c3e97b9. The basic config was: [checking] threads=0 maxrequestspersecond=10000 [filtering] checkextern=0 [output] fileoutput=html log=none Without [AnchorCheck], the times to check 478 links were: local files: 4 seconds local dev webserver: 6 seconds With [AnchorCheck] and anchorcachesize=0 (which effectively disables this cache), the times to check 1753 links (each link with a distinct anchor counts, now) were: local files: 57 seconds local dev webserver: 68 seconds With [AnchorCheck] and anchorcachesize=2000 (the default, and big enough for my entire dataset), this cache), the times to check 1753 links were: local files: 5 seconds local dev webserver: 16 seconds (changing threads to 5 reduces the time to 11 seconds) (Without commit b4455d5, but with 5 threads, against the local dev webserver, the time would have been 390 seconds, even with this cache!) I have not tested against a remote / high-latency / slow web server; I expect anchor checking might still be substantially slower than non-anchor-checking in that case. Note that there is some overhead introduced here in the non-AnchorCheck case, for calling the synchorinzed get() and put() methods on the cache. That didn't seem to make a meaningful difference in my testing, but it's possible there are scenarios where it could matter. If so, the synchronization to be moved to private methods and the public methods could check the cache size to see if it's 0 before trying to use it. Note that there is still probably documentation work to be done for this new setting (anchorcachesize) - I updated the places I could understand to update, but I don't think I got everywhere.
4f48981
to
fb388b5
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is meant to be based on #660, but I don't see a way to do that.
See #648 for the history of this work. This PR is a subset of that work, as requested in the comments of that PR.