Greatly improve AnchorCheck performance #661

pacenathan · 2022-09-08T14:43:31Z

This is meant to be based on #660, but I don't see a way to do that.

See #648 for the history of this work. This PR is a subset of that work, as requested in the comments of that PR.

I started with a test of urlencoded anchors, assuming at the URL might have a urlencoded anchor, but the actual anchor in the HTML would NOT be urlencoded. Then discovered that fileurl.py was stripping the anchors from url_data, which breaks AnchorCheck. So I stopped it from doing that, and tried to fix up all the places that were assuming the url would map to a filesystem file. The tests all pass, but I'm not 100% sure I caught all the cases, or fixed them correctly.

Also make pylint happy for the changes I've made

I added a test for file:// processing, and it was showing different results for when the URL anchor was and wasn't quoted. I tracked it down to code in fileurl.py that was calling url_norm, and I'm pretty sure the code is unnecessary at this point. But I made a minimally-invasive change, to be as safe as possible.

Also stop using UrlBase.anchors in favor of a new method that is based on UrlBase.url_data[4], which is more consistent with the rest of the code. The threading issue has been there for years, but I didn't notice it until after I thought I was done, while I was doing manual testing (with threads re-enabled). The problem was with storing URL-specific state (.anchors) on the AnchorCheck object itself, because there's only one global AnchorCheck object, so all the threads are competing to use that one simgle variable (self.anchors). The solution was to create a new object to hold .anchors, for each processed URL.

Also removed the documentation saying AnchorCheck is disabled. A downside to the prior AnchorCheck implementation was that it caused the system to completely re-process a given URL (i.e. a document) each time there was a URL that referenced a new (distinct) anchor in that document. That caused enormous slowdowns. This commit introduces a new cache class and three new places that use that cache, which avoids most of the reprocessing and massively improves performance. Note, however, that even under this implementataion there is still substantial wasted time in e.g. opening new HTTP connections to servers and redoing processing that has already been done once. I don't see any way to remediate that without major invasive surgery. For future reference, I did manual performance testing, with and without this change, on a test set of about 150 reasonably-large HTML files generated from markdown documentation (mkdocs). These times are based on the performance improvements I made in commit c3e97b9. The basic config was: [checking] threads=0 maxrequestspersecond=10000 [filtering] checkextern=0 [output] fileoutput=html log=none Without [AnchorCheck], the times to check 478 links were: local files: 4 seconds local dev webserver: 6 seconds With [AnchorCheck] and anchorcachesize=0 (which effectively disables this cache), the times to check 1753 links (each link with a distinct anchor counts, now) were: local files: 57 seconds local dev webserver: 68 seconds With [AnchorCheck] and anchorcachesize=2000 (the default, and big enough for my entire dataset), this cache), the times to check 1753 links were: local files: 5 seconds local dev webserver: 16 seconds (changing threads to 5 reduces the time to 11 seconds) (Without commit b4455d5, but with 5 threads, against the local dev webserver, the time would have been 390 seconds, even with this cache!) I have not tested against a remote / high-latency / slow web server; I expect anchor checking might still be substantially slower than non-anchor-checking in that case. Note that there is some overhead introduced here in the non-AnchorCheck case, for calling the synchorinzed get() and put() methods on the cache. That didn't seem to make a meaningful difference in my testing, but it's possible there are scenarios where it could matter. If so, the synchronization to be moved to private methods and the public methods could check the cache size to see if it's 0 before trying to use it. Note that there is still probably documentation work to be done for this new setting (anchorcachesize) - I updated the places I could understand to update, but I don't think I got everywhere.

pacenathan mentioned this pull request Sep 8, 2022

Fix all the bugs in AnchorCheck #648

Closed

pacenathan force-pushed the AnchorCheckPerformance branch 3 times, most recently from ab8d616 to 4f48981 Compare September 10, 2022 19:52

pacenathan added 8 commits September 22, 2022 16:41

Make flake8 happy

1306516

Also make pylint happy for the changes I've made

Add an anchor test for multiple inter-connected files

81c3914

Make flake8 happy

623175f

Make flake8 happy

fb388b5

pacenathan force-pushed the AnchorCheckPerformance branch from 4f48981 to fb388b5 Compare September 22, 2022 20:54

roberth mentioned this pull request Mar 20, 2023

Make per-module link check tests faster hercules-ci/flake.parts-website#216

Closed

Kristinita mentioned this pull request Jan 24, 2024

Optimize AnchorCheck plugin to avoid that it downloads references to anchors in the same file multiple times #781

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Greatly improve AnchorCheck performance #661

Greatly improve AnchorCheck performance #661

pacenathan commented Sep 8, 2022

Greatly improve AnchorCheck performance #661

Are you sure you want to change the base?

Greatly improve AnchorCheck performance #661

Conversation

pacenathan commented Sep 8, 2022