Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greatly improve AnchorCheck performance #661

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

pacenathan
Copy link
Contributor

This is meant to be based on #660, but I don't see a way to do that.

See #648 for the history of this work. This PR is a subset of that work, as requested in the comments of that PR.

@pacenathan pacenathan force-pushed the AnchorCheckPerformance branch 3 times, most recently from ab8d616 to 4f48981 Compare September 10, 2022 19:52
I started with a test of urlencoded anchors, assuming at the URL might
have a urlencoded anchor, but the actual anchor in the HTML would NOT be
urlencoded.

Then discovered that fileurl.py was stripping the anchors from url_data,
which breaks AnchorCheck. So I stopped it from doing that, and
tried to fix up all the places that were assuming the url would map to a
filesystem file. The tests all pass, but I'm not 100% sure I caught all
the cases, or fixed them correctly.
Also make pylint happy for the changes I've made
I added a test for file:// processing, and it was showing different
results for when the URL anchor was and wasn't quoted. I tracked it down
to code in fileurl.py that was calling url_norm, and I'm pretty sure the
code is unnecessary at this point. But I made a minimally-invasive
change, to be as safe as possible.
Also stop using UrlBase.anchors in favor of a new method that is based
on UrlBase.url_data[4], which is more consistent with the rest of the
code.

The threading issue has been there for years, but I didn't notice it
until after I thought I was done, while I was doing manual testing
(with threads re-enabled).

The problem was with storing URL-specific state (.anchors) on the
AnchorCheck object itself, because there's only one global AnchorCheck
object, so all the threads are competing to use that one simgle variable
(self.anchors).

The solution was to create a new object to hold .anchors, for each
processed URL.
Also removed the documentation saying AnchorCheck is disabled.

A downside to the prior AnchorCheck implementation was that it caused the
system to completely re-process a given URL (i.e. a document) each time
there was a URL that referenced a new (distinct) anchor in that document.
That caused enormous slowdowns. This commit introduces a new cache class
and three new places that use that cache, which avoids most of the
reprocessing and massively improves performance.

Note, however, that even under this implementataion there is still
substantial wasted time in e.g. opening new HTTP connections to servers
and redoing processing that has already been done once. I don't see any
way to remediate that without major invasive surgery.

For future reference, I did manual performance testing, with and
without this change, on a test set of about 150 reasonably-large HTML
files generated from markdown documentation (mkdocs).

These times are based on the performance improvements I made in commit
c3e97b9.

The basic config was:

[checking]
threads=0
maxrequestspersecond=10000
[filtering]
checkextern=0
[output]
fileoutput=html
log=none

Without [AnchorCheck], the times to check 478 links were:

local files: 4 seconds
local dev webserver: 6 seconds

With [AnchorCheck] and anchorcachesize=0 (which effectively disables
this cache), the times to check 1753 links (each link with a distinct
anchor counts, now) were:

local files: 57 seconds
local dev webserver: 68 seconds

With [AnchorCheck] and anchorcachesize=2000 (the default, and big enough
for my entire dataset), this cache), the times to check 1753 links were:

local files: 5 seconds
local dev webserver: 16 seconds
(changing threads to 5 reduces the time to 11 seconds)

(Without commit b4455d5, but with 5 threads, against the local dev
webserver, the time would have been 390 seconds, even with this cache!)

I have not tested against a remote / high-latency / slow web server; I
expect anchor checking might still be substantially slower than
non-anchor-checking in that case.

Note that there is some overhead introduced here in the non-AnchorCheck
case, for calling the synchorinzed get() and put() methods on the cache.
That didn't seem to make a meaningful difference in my testing, but it's
possible there are scenarios where it could matter. If so, the
synchronization to be moved to private methods and the public methods
could check the cache size to see if it's 0 before trying to use it.

Note that there is still probably documentation work to be done for this
new setting (anchorcachesize) - I updated the places I could understand
to update, but I don't think I got everywhere.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant