Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skips links marked as known broken #370

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

nickpiggott
Copy link

Skips anchor links that have a class value of "broken_link".

Should add this as a config option to allow any class name to be excluded from checking.

Skips links that have a class value of "broken_link".

Should add this as a config option to allow any class name to be excluded from checking.
Copy link
Contributor

@mgedmin mgedmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a nice feature, but to be complete it needs documentation and ideally a unit test. (And it shouldn't break existing unit tests.)

linkcheck/htmlutil/linkparse.py Outdated Show resolved Hide resolved
@@ -175,6 +175,10 @@ def start_element (self, tag, attrs, element_text, lineno, column):
log.debug(LOG_CHECK, "line %d col %d", lineno, column)
if tag == "base" and not self.base_ref:
self.base_ref = attrs.get("href", u'')
if tag =="a" and attrs.get_true('class', u''):
if ("broken_link" in attrs.get('class')):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be configurable? Some sites might use a different class name.

I think we should .split() the class attribute so substring matching wouldn't cause accidental matches for unrelated classes like <a class="unbroken_linkage"> or something.

@nickpiggott
Copy link
Author

Hi

I'm afraid I'm definitely not a Python pro, so thanks for the guidance.

  • Yes, it should be configurable on the command line / config file. I'll add an option
  • Yes, it should split(). That was a crude hammer to see if it was working as expected and broadly the right approach

I'll make some changes, add the config option, and document. I'll need to get some help on designing the unit test.

@nickpiggott
Copy link
Author

I've made some changes, including adding switches / config options for a list of classes to ignore.

I'm not sure the best way to pull that definition of class names into linkparse.py, so that's still statically coded as a variable definition in linkparse.py. Suggestions welcome on how best to pull config["ignoreclass"] in to linkparse.py.

Copy link
Contributor

@mgedmin mgedmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused: given all the indentation errors I see in the diff, why are tests succeeding?

Is GitHub broken and showing me the diff incorrectly?

linkcheck/configuration/confparse.py Outdated Show resolved Hide resolved
linkcheck/configuration/__init__.py Outdated Show resolved Hide resolved
linkcheck/htmlutil/linkparse.py Outdated Show resolved Hide resolved
linkcheck/htmlutil/linkparse.py Outdated Show resolved Hide resolved
linkcheck/configuration/confparse.py Outdated Show resolved Hide resolved
linkcheck/htmlutil/linkparse.py Outdated Show resolved Hide resolved
linkcheck/htmlutil/linkparse.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mgedmin mgedmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is approaching the finish line very nicely! Only a few of small tweaks remain

linkcheck/htmlutil/linkparse.py Outdated Show resolved Hide resolved
linkchecker Outdated Show resolved Hide resolved
linkchecker Outdated Show resolved Hide resolved
linkcheck/configuration/confparse.py Outdated Show resolved Hide resolved
@nickpiggott
Copy link
Author

I think this latest set of changes fixes up everything. the --ignore-classes switch takes a comma separated list of classes, which are excluded if any of them are found on the anchor link.

I'll work out how to create and run some tests. I'm using this on a production box each night, so can keep an eye on it too.

Copy link
Contributor

@mgedmin mgedmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would LGTM this as-is, if it weren't for the broken tests.

@@ -156,10 +156,11 @@ class LinkFinder (TagFinder):
"""Find HTML links, and apply them to the callback function with the
format (url, lineno, column, name, codebase)."""

def __init__ (self, callback, tags):
def __init__ (self, callback, tags, ignore_classes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new required positional parameter. You've updated one call site, but two more remain:

tests/test_linkparser.py:        h = linkparse.LinkFinder(self._test_one_url(url), linkparse.LinkTags)
tests/test_linkparser.py:        h = linkparse.LinkFinder(callback, linkparse.LinkTags)

One possibility: make this a keyword argument with the default value of None, then you won't have to touch unrelated unit tests.

OTOH by looking at the existing tests for LinkFinder you may discover how to add a new test for this feature. Or maybe an integration test could be better.

@mgedmin
Copy link
Contributor

mgedmin commented Apr 24, 2020

I'll work out how to create and run some tests. I'm using this on a production box each night, so can keep an eye on it too.

It would be nice to have one integration test. AFAIU these are mostly pairs of files in tests/checker/data/: an HTML file and a .result file. I would suggest copying an existing test and modifying the HTML to add a class attribute and modifying the .results file to exclude the link that should be omitted. Next, the test itself that uses the files -- check out tests/checker/test_http.py. There's a test class, and a test method, and it calls self.file_test() and passes the filename. So if you e.g. copy http.html to http_ignoreclass.html, you can then add an invocation to self.file_test, and then I think you can pass the ignoreclasses=... config option in the confargs.

I'm not sure what stage of config parsing this exists in, and whether you should store a list or a comma-separated string in confargs. I would guess a list.

Running the tests should be simple: tox (if you've got it installed) should take care of installing all the dependencies in its local virtualenvs etc. For rapidly iterating on one particular test I like tox -e py37 -- tests/checker/test_http.py::TestClass::test_name.

(I hope I'm not misleading you with thes suggestions: I'm not one of the original developers of linkchecker and my familiarity with its internal workings and the test suite extends only as much as I've needed to figure out a couple of bug fixes in the past. I'm only doing code reviews because somebody has to, and the original devs are MIA.)

@mgedmin
Copy link
Contributor

mgedmin commented Apr 24, 2020

It would be nice to have one integration test.

I said one, because zero is bad (maybe all units work correctly but aren't hooked up together right), and more than one is also bad (integration tests are slow, and one bug tends to break many integration tests).

I haven't said how many unit tests I want. The right number is somewhere between zero and however much is needed to achieve 100% coverage for all the changed lines of code. (https://pypi.org/project/diff-cover/ is a nice tool for this, we should maybe hook it up.) It depends on how much time you have and how much you enjoy writing unit tests ;)

@anarcat
Copy link
Contributor

anarcat commented May 22, 2020

the CI build fails here now and conflicts need to be resolved, sorry... also pay attention to flake8 warnings in the CI logs: they are not enforced yet, but they are worth fixing.

@nickpiggott
Copy link
Author

Update: I'm working on some tests now, but is turning out to be more work than I anticipated, and needs more code refactoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants