-
-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser is unable to capture attrs that have nested quote marks of the same type #19
Comments
Yikes! That's some pretty nasty HTML. I'm actually surprised that Unsurprisingly, bs4 also fails with that snippet:
Let me think on this! I'm planning to improve void tag handling in the coming weeks, could probably bunch this in with that work. |
Yeah I was surprised to see it action! Seeing as bs4 also fails on this, this seems to be an exotic edge case. Totally understood if we leave this as Either way thanks for the response, and thanks for the library! |
@paw-lu I wasn't able to get this in the 1.0 release... but I'm still thinking about it. After some digging it turns out the extra from html.parser import HTMLParser
class OverrideParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print(attrs)
super().handle_starttag(tag, attrs)
html = """<div tooltip-content="{'id': '7', 'graph': '1->2'}">text</div>"""
parser = OverrideParser()
parser.feed(html) So long as the quote marks are nested properly it'll return: [('tooltip-content', "{'id': '7', 'graph': '1->2'}")] So, I wonder, how can we capture and parse your double/malformed |
Describe the bug
Came across this issue in the wild. If there is a
">"
character in an attribute, the parser will misinterpret that as the closing tag, and the parsed text will include the some strings from the attributes.To Reproduce
Code to reproduce the behaviour:
Expected behavior
Environment:
Was just recommended this library and am a huge fan of the api you came up with, thanks a lot for this project!
The text was updated successfully, but these errors were encountered: