Parser is unable to capture attrs that have nested quote marks of the same type #19

paw-lu · 2020-09-16T16:53:45Z

Describe the bug
Came across this issue in the wild. If there is a ">" character in an attribute, the parser will misinterpret that as the closing tag, and the parsed text will include the some strings from the attributes.

To Reproduce
Code to reproduce the behaviour:

>>> import gazpacho
>>> html = '<div tooltip-content="{"id": "7", "graph": "1->2"}">text</div>'
>>> soup = gazpacho.Soup(html)
>>> soup.find("div"}).text
'2"}">text'

Expected behavior

>>> import gazpacho
>>> html = '<div tooltip-content="{"id": "7", "graph": "1->2"}">text</div>'
>>> soup = gazpacho.Soup(html)
>>> soup.find("div").text
'text'

Environment:

OS: macOS
Version: 10.15.6

Was just recommended this library and am a huge fan of the api you came up with, thanks a lot for this project!

The text was updated successfully, but these errors were encountered:

maxhumber · 2020-09-16T17:03:57Z

Yikes! That's some pretty nasty HTML.

I'm actually surprised that .find() even picks it up!

Unsurprisingly, bs4 also fails with that snippet:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find("div").text
# '2"}">text'

Let me think on this! I'm planning to improve void tag handling in the coming weeks, could probably bunch this in with that work.

paw-lu · 2020-09-16T21:49:30Z

Yeah I was surprised to see it action!

Seeing as bs4 also fails on this, this seems to be an exotic edge case. Totally understood if we leave this as won't fix.

Either way thanks for the response, and thanks for the library!

maxhumber · 2020-10-01T18:25:37Z

@paw-lu I wasn't able to get this in the 1.0 release... but I'm still thinking about it.

After some digging it turns out the extra > isn't the problem. Check it:

from html.parser import HTMLParser

class OverrideParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print(attrs)
        super().handle_starttag(tag, attrs)

html = """<div tooltip-content="{'id': '7', 'graph': '1->2'}">text</div>"""
parser = OverrideParser()
parser.feed(html)

So long as the quote marks are nested properly it'll return:

[('tooltip-content', "{'id': '7', 'graph': '1->2'}")]

So, I wonder, how can we capture and parse your double/malformed "quotes"?

maxhumber added hacktoberfest Hacktoberfest help wanted Extra attention is needed labels Sep 22, 2020

paw-lu changed the title ~~Parser gets misinterprets ">" in attribute as a closing tag~~ Parser misinterprets ">" in attribute as a closing tag Sep 25, 2020

maxhumber changed the title ~~Parser misinterprets ">" in attribute as a closing tag~~ Parser is unable to capture attrs that have nested quote marks of the same type Oct 2, 2020

maxhumber removed hacktoberfest Hacktoberfest help wanted Extra attention is needed labels Apr 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser is unable to capture attrs that have nested quote marks of the same type #19

Parser is unable to capture attrs that have nested quote marks of the same type #19

paw-lu commented Sep 16, 2020 •

edited

maxhumber commented Sep 16, 2020

paw-lu commented Sep 16, 2020

maxhumber commented Oct 1, 2020 •

edited

Parser is unable to capture attrs that have nested quote marks of the same type #19

Parser is unable to capture attrs that have nested quote marks of the same type #19

Comments

paw-lu commented Sep 16, 2020 • edited

maxhumber commented Sep 16, 2020

paw-lu commented Sep 16, 2020

maxhumber commented Oct 1, 2020 • edited

paw-lu commented Sep 16, 2020 •

edited

maxhumber commented Oct 1, 2020 •

edited