Skips links marked as known broken #370

nickpiggott · 2020-04-21T12:48:22Z

Skips anchor links that have a class value of "broken_link".

Should add this as a config option to allow any class name to be excluded from checking.

Skips links that have a class value of "broken_link". Should add this as a config option to allow any class name to be excluded from checking.

mgedmin

This seems like a nice feature, but to be complete it needs documentation and ideally a unit test. (And it shouldn't break existing unit tests.)

linkcheck/htmlutil/linkparse.py

mgedmin · 2020-04-21T19:55:51Z

linkcheck/htmlutil/linkparse.py

@@ -175,6 +175,10 @@ def start_element (self, tag, attrs, element_text, lineno, column):
        log.debug(LOG_CHECK, "line %d col %d", lineno, column)
        if tag == "base" and not self.base_ref:
            self.base_ref = attrs.get("href", u'')
+        if tag =="a" and attrs.get_true('class', u''):
+           if ("broken_link" in attrs.get('class')):


I wonder if this should be configurable? Some sites might use a different class name.

I think we should .split() the class attribute so substring matching wouldn't cause accidental matches for unrelated classes like <a class="unbroken_linkage"> or something.

nickpiggott · 2020-04-21T20:08:15Z

Hi

I'm afraid I'm definitely not a Python pro, so thanks for the guidance.

Yes, it should be configurable on the command line / config file. I'll add an option
Yes, it should split(). That was a crude hammer to see if it was working as expected and broadly the right approach

I'll make some changes, add the config option, and document. I'll need to get some help on designing the unit test.

nickpiggott · 2020-04-22T14:34:40Z

I've made some changes, including adding switches / config options for a list of classes to ignore.

I'm not sure the best way to pull that definition of class names into linkparse.py, so that's still statically coded as a variable definition in linkparse.py. Suggestions welcome on how best to pull config["ignoreclass"] in to linkparse.py.

mgedmin

I'm confused: given all the indentation errors I see in the diff, why are tests succeeding?

Is GitHub broken and showing me the diff incorrectly?

linkcheck/configuration/confparse.py

linkcheck/configuration/__init__.py

linkcheck/htmlutil/linkparse.py

linkcheck/configuration/confparse.py

linkcheck/htmlutil/linkparse.py

mgedmin

This is approaching the finish line very nicely! Only a few of small tweaks remain

linkcheck/htmlutil/linkparse.py

linkchecker

linkcheck/configuration/confparse.py

nickpiggott · 2020-04-24T14:39:58Z

I think this latest set of changes fixes up everything. the --ignore-classes switch takes a comma separated list of classes, which are excluded if any of them are found on the anchor link.

I'll work out how to create and run some tests. I'm using this on a production box each night, so can keep an eye on it too.

mgedmin

I would LGTM this as-is, if it weren't for the broken tests.

mgedmin · 2020-04-24T17:02:37Z

linkcheck/htmlutil/linkparse.py

@@ -156,10 +156,11 @@ class LinkFinder (TagFinder):
    """Find HTML links, and apply them to the callback function with the
    format (url, lineno, column, name, codebase)."""

-    def __init__ (self, callback, tags):
+    def __init__ (self, callback, tags, ignore_classes):


This is a new required positional parameter. You've updated one call site, but two more remain:

tests/test_linkparser.py: h = linkparse.LinkFinder(self._test_one_url(url), linkparse.LinkTags) tests/test_linkparser.py: h = linkparse.LinkFinder(callback, linkparse.LinkTags)

One possibility: make this a keyword argument with the default value of None, then you won't have to touch unrelated unit tests.

OTOH by looking at the existing tests for LinkFinder you may discover how to add a new test for this feature. Or maybe an integration test could be better.

mgedmin · 2020-04-24T17:16:36Z

I'll work out how to create and run some tests. I'm using this on a production box each night, so can keep an eye on it too.

It would be nice to have one integration test. AFAIU these are mostly pairs of files in tests/checker/data/: an HTML file and a .result file. I would suggest copying an existing test and modifying the HTML to add a class attribute and modifying the .results file to exclude the link that should be omitted. Next, the test itself that uses the files -- check out tests/checker/test_http.py. There's a test class, and a test method, and it calls self.file_test() and passes the filename. So if you e.g. copy http.html to http_ignoreclass.html, you can then add an invocation to self.file_test, and then I think you can pass the ignoreclasses=... config option in the confargs.

I'm not sure what stage of config parsing this exists in, and whether you should store a list or a comma-separated string in confargs. I would guess a list.

Running the tests should be simple: tox (if you've got it installed) should take care of installing all the dependencies in its local virtualenvs etc. For rapidly iterating on one particular test I like tox -e py37 -- tests/checker/test_http.py::TestClass::test_name.

(I hope I'm not misleading you with thes suggestions: I'm not one of the original developers of linkchecker and my familiarity with its internal workings and the test suite extends only as much as I've needed to figure out a couple of bug fixes in the past. I'm only doing code reviews because somebody has to, and the original devs are MIA.)

mgedmin · 2020-04-24T17:19:14Z

It would be nice to have one integration test.

I said one, because zero is bad (maybe all units work correctly but aren't hooked up together right), and more than one is also bad (integration tests are slow, and one bug tends to break many integration tests).

I haven't said how many unit tests I want. The right number is somewhere between zero and however much is needed to achieve 100% coverage for all the changed lines of code. (https://pypi.org/project/diff-cover/ is a nice tool for this, we should maybe hook it up.) It depends on how much time you have and how much you enjoy writing unit tests ;)

anarcat · 2020-05-22T13:06:06Z

the CI build fails here now and conflicts need to be resolved, sorry... also pay attention to flake8 warnings in the CI logs: they are not enforced yet, but they are worth fixing.

nickpiggott · 2020-05-22T13:14:32Z

Update: I'm working on some tests now, but is turning out to be more work than I anticipated, and needs more code refactoring.

Skips links marked as known broken

7e97929

Skips links that have a class value of "broken_link". Should add this as a config option to allow any class name to be excluded from checking.

nickpiggott mentioned this pull request Apr 21, 2020

Feature Question: Excluding links marked in CSS as broken #369

Open

mgedmin requested changes Apr 21, 2020

View reviewed changes

nickpiggott added 4 commits April 22, 2020 13:23

Building ignoreclass as a config option

5f120a3

Update linkparse.py

4370b7b

Fixes for tab/spaces

3ebbd54

Enable list of classes to be checked

d7fe45d

mgedmin reviewed Apr 22, 2020

View reviewed changes

linkcheck/configuration/confparse.py Outdated Show resolved Hide resolved

linkcheck/configuration/__init__.py Outdated Show resolved Hide resolved

linkcheck/htmlutil/linkparse.py Outdated Show resolved Hide resolved

linkcheck/htmlutil/linkparse.py Outdated Show resolved Hide resolved

mgedmin reviewed Apr 22, 2020

View reviewed changes

linkcheck/configuration/confparse.py Outdated Show resolved Hide resolved

linkcheck/htmlutil/linkparse.py Outdated Show resolved Hide resolved

linkcheck/htmlutil/linkparse.py Outdated Show resolved Hide resolved

nickpiggott added 3 commits April 24, 2020 12:39

Update confparse.py

b2c9cad

Code updates to support ignore_classes as a config option

1271fe0

Incorrect entry

55fcb3a

mgedmin reviewed Apr 24, 2020

View reviewed changes

linkcheck/htmlutil/linkparse.py Outdated Show resolved Hide resolved

linkchecker Outdated Show resolved Hide resolved

linkchecker Outdated Show resolved Hide resolved

linkcheck/configuration/confparse.py Outdated Show resolved Hide resolved

nickpiggott added 4 commits April 24, 2020 14:43

Fixed spacing and references to command line switches

30008ca

Associated ignore_classes to self

c8887b0

Fix transfer of classes to method

c91edb0

Corrected reference to args

ab0a35d

mgedmin reviewed Apr 24, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skips links marked as known broken #370

Skips links marked as known broken #370

nickpiggott commented Apr 21, 2020

mgedmin left a comment

mgedmin Apr 21, 2020

nickpiggott commented Apr 21, 2020

nickpiggott commented Apr 22, 2020

mgedmin left a comment

mgedmin left a comment

nickpiggott commented Apr 24, 2020

mgedmin left a comment

mgedmin Apr 24, 2020

mgedmin commented Apr 24, 2020

mgedmin commented Apr 24, 2020

anarcat commented May 22, 2020

nickpiggott commented May 22, 2020

Skips links marked as known broken #370

Are you sure you want to change the base?

Skips links marked as known broken #370

Conversation

nickpiggott commented Apr 21, 2020

mgedmin left a comment

Choose a reason for hiding this comment

mgedmin Apr 21, 2020

Choose a reason for hiding this comment

nickpiggott commented Apr 21, 2020

nickpiggott commented Apr 22, 2020

mgedmin left a comment

Choose a reason for hiding this comment

mgedmin left a comment

Choose a reason for hiding this comment

nickpiggott commented Apr 24, 2020

mgedmin left a comment

Choose a reason for hiding this comment

mgedmin Apr 24, 2020

Choose a reason for hiding this comment

mgedmin commented Apr 24, 2020

mgedmin commented Apr 24, 2020

anarcat commented May 22, 2020

nickpiggott commented May 22, 2020