-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skips links marked as known broken #370
base: master
Are you sure you want to change the base?
Conversation
Skips links that have a class value of "broken_link". Should add this as a config option to allow any class name to be excluded from checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a nice feature, but to be complete it needs documentation and ideally a unit test. (And it shouldn't break existing unit tests.)
linkcheck/htmlutil/linkparse.py
Outdated
@@ -175,6 +175,10 @@ def start_element (self, tag, attrs, element_text, lineno, column): | |||
log.debug(LOG_CHECK, "line %d col %d", lineno, column) | |||
if tag == "base" and not self.base_ref: | |||
self.base_ref = attrs.get("href", u'') | |||
if tag =="a" and attrs.get_true('class', u''): | |||
if ("broken_link" in attrs.get('class')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be configurable? Some sites might use a different class name.
I think we should .split()
the class attribute so substring matching wouldn't cause accidental matches for unrelated classes like <a class="unbroken_linkage">
or something.
Hi I'm afraid I'm definitely not a Python pro, so thanks for the guidance.
I'll make some changes, add the config option, and document. I'll need to get some help on designing the unit test. |
I've made some changes, including adding switches / config options for a list of classes to ignore. I'm not sure the best way to pull that definition of class names into linkparse.py, so that's still statically coded as a variable definition in linkparse.py. Suggestions welcome on how best to pull config["ignoreclass"] in to linkparse.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused: given all the indentation errors I see in the diff, why are tests succeeding?
Is GitHub broken and showing me the diff incorrectly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is approaching the finish line very nicely! Only a few of small tweaks remain
I think this latest set of changes fixes up everything. the --ignore-classes switch takes a comma separated list of classes, which are excluded if any of them are found on the anchor link. I'll work out how to create and run some tests. I'm using this on a production box each night, so can keep an eye on it too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would LGTM this as-is, if it weren't for the broken tests.
@@ -156,10 +156,11 @@ class LinkFinder (TagFinder): | |||
"""Find HTML links, and apply them to the callback function with the | |||
format (url, lineno, column, name, codebase).""" | |||
|
|||
def __init__ (self, callback, tags): | |||
def __init__ (self, callback, tags, ignore_classes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new required positional parameter. You've updated one call site, but two more remain:
tests/test_linkparser.py: h = linkparse.LinkFinder(self._test_one_url(url), linkparse.LinkTags)
tests/test_linkparser.py: h = linkparse.LinkFinder(callback, linkparse.LinkTags)
One possibility: make this a keyword argument with the default value of None
, then you won't have to touch unrelated unit tests.
OTOH by looking at the existing tests for LinkFinder you may discover how to add a new test for this feature. Or maybe an integration test could be better.
It would be nice to have one integration test. AFAIU these are mostly pairs of files in tests/checker/data/: an HTML file and a .result file. I would suggest copying an existing test and modifying the HTML to add a class attribute and modifying the .results file to exclude the link that should be omitted. Next, the test itself that uses the files -- check out tests/checker/test_http.py. There's a test class, and a test method, and it calls self.file_test() and passes the filename. So if you e.g. copy http.html to http_ignoreclass.html, you can then add an invocation to self.file_test, and then I think you can pass the ignoreclasses=... config option in the I'm not sure what stage of config parsing this exists in, and whether you should store a list or a comma-separated string in Running the tests should be simple: (I hope I'm not misleading you with thes suggestions: I'm not one of the original developers of linkchecker and my familiarity with its internal workings and the test suite extends only as much as I've needed to figure out a couple of bug fixes in the past. I'm only doing code reviews because somebody has to, and the original devs are MIA.) |
I said one, because zero is bad (maybe all units work correctly but aren't hooked up together right), and more than one is also bad (integration tests are slow, and one bug tends to break many integration tests). I haven't said how many unit tests I want. The right number is somewhere between zero and however much is needed to achieve 100% coverage for all the changed lines of code. (https://pypi.org/project/diff-cover/ is a nice tool for this, we should maybe hook it up.) It depends on how much time you have and how much you enjoy writing unit tests ;) |
the CI build fails here now and conflicts need to be resolved, sorry... also pay attention to flake8 warnings in the CI logs: they are not enforced yet, but they are worth fixing. |
Update: I'm working on some tests now, but is turning out to be more work than I anticipated, and needs more code refactoring. |
Skips anchor links that have a class value of "broken_link".
Should add this as a config option to allow any class name to be excluded from checking.