Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser is unable to capture attrs that have nested quote marks of the same type #19

Open
paw-lu opened this issue Sep 16, 2020 · 3 comments

Comments

@paw-lu
Copy link

paw-lu commented Sep 16, 2020

Describe the bug
Came across this issue in the wild. If there is a ">" character in an attribute, the parser will misinterpret that as the closing tag, and the parsed text will include the some strings from the attributes.

To Reproduce
Code to reproduce the behaviour:

>>> import gazpacho
>>> html = '<div tooltip-content="{"id": "7", "graph": "1->2"}">text</div>'
>>> soup = gazpacho.Soup(html)
>>> soup.find("div"}).text
'2"}">text'

Expected behavior

>>> import gazpacho
>>> html = '<div tooltip-content="{"id": "7", "graph": "1->2"}">text</div>'
>>> soup = gazpacho.Soup(html)
>>> soup.find("div").text
'text'

Environment:

  • OS: macOS
  • Version: 10.15.6

Was just recommended this library and am a huge fan of the api you came up with, thanks a lot for this project!

@maxhumber
Copy link
Owner

Yikes! That's some pretty nasty HTML.

I'm actually surprised that .find() even picks it up!

Unsurprisingly, bs4 also fails with that snippet:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find("div").text
# '2"}">text'

Let me think on this! I'm planning to improve void tag handling in the coming weeks, could probably bunch this in with that work.

@paw-lu
Copy link
Author

paw-lu commented Sep 16, 2020

Yeah I was surprised to see it action!

Seeing as bs4 also fails on this, this seems to be an exotic edge case. Totally understood if we leave this as won't fix.

Either way thanks for the response, and thanks for the library!

@maxhumber maxhumber added hacktoberfest Hacktoberfest help wanted Extra attention is needed labels Sep 22, 2020
@paw-lu paw-lu changed the title Parser gets misinterprets ">" in attribute as a closing tag Parser misinterprets ">" in attribute as a closing tag Sep 25, 2020
@maxhumber
Copy link
Owner

maxhumber commented Oct 1, 2020

@paw-lu I wasn't able to get this in the 1.0 release... but I'm still thinking about it.

After some digging it turns out the extra > isn't the problem. Check it:

from html.parser import HTMLParser

class OverrideParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print(attrs)
        super().handle_starttag(tag, attrs)

html = """<div tooltip-content="{'id': '7', 'graph': '1->2'}">text</div>"""
parser = OverrideParser()
parser.feed(html)

So long as the quote marks are nested properly it'll return:

[('tooltip-content', "{'id': '7', 'graph': '1->2'}")]

So, I wonder, how can we capture and parse your double/malformed "quotes"?

@maxhumber maxhumber changed the title Parser misinterprets ">" in attribute as a closing tag Parser is unable to capture attrs that have nested quote marks of the same type Oct 2, 2020
@maxhumber maxhumber removed hacktoberfest Hacktoberfest help wanted Extra attention is needed labels Apr 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants