Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New model addition: MarkupLM #692

Open
2 tasks done
pogzyb opened this issue Apr 10, 2024 · 4 comments
Open
2 tasks done

New model addition: MarkupLM #692

pogzyb opened this issue Apr 10, 2024 · 4 comments
Labels
new model Request a new model

Comments

@pogzyb
Copy link

pogzyb commented Apr 10, 2024

Model description

The MarkupLM is BERT, but applied to HTML pages instead of raw text documents. Seems like there could be a lot of interesting uses for this type of model in the browser.

Prerequisites

  • The model is supported in Transformers (i.e., listed here)
  • The model can be exported to ONNX with Optimum (i.e., listed here)

Additional information

I think the most difficult part of the implementation will deal with markuplm's preprocessing. Specifically, markuplm uses a combination of a "feature extractor" and a "tokenizer". the "feature extractor" extracts nodes and xpaths from HTML strings. These nodes and xpaths are then fed to the "tokenizer" to produce xpath tag and subscript sequences. The Python implementation uses BeautifulSoup, so the JavaScript implementation might need a 3rd party HTML parsing library if DOMParser doesn't cut it.

In short, there are 2 additional xpath inputs to the model needed: 'input_ids', 'token_type_ids', 'attention_mask', 'xpath_tags_seq', 'xpath_subs_seq'

Your contribution

I added huggingface/optimum#1784 in optimum, but I'm not much of a JavaScript developer. I'd be happy to try either implementing the preprocessing or the pipeline, but I would need some guidance/regular reviews.

@pogzyb pogzyb added the new model Request a new model label Apr 10, 2024
@xenova
Copy link
Owner

xenova commented Apr 10, 2024

Hi there! 👋 This does sound pretty interesting! I would imagine the built-in document parser should be sufficient. I'd be happy to review if you (or another community member) would like to open a PR!

@jonathanpv
Copy link
Contributor

jonathanpv commented Apr 11, 2024

It's not entirely obvious to me what this model does from the hugging face docs, but if its able to make an

xpath -> goal / feature

then we can support local agentic solutions, or writing out instructions, running them would be a different story

For example:
xpath
/some/div/here -> this div is a button that will submit an order
/some/other/div/here -> this div handles file uploading

benefit:
local-first private solution, could be a quick "accessibility-vibe-check" to see if an ai can figure it out your user can too?

Thats just one app idea, here's another:

browser LLM powered site cloner:
/some/div/here -> this div is a button that will submit an order
/some/other/div/here -> this div handles file uploading

then pass those divs as css-selectors or xpath-selector? to select a div to clone / translate to nextjs components using gpt4 or some hugging face model that excels in coding front end things

benefit:
token efficiency vs cloud solution, local-first approach

here's the app ideas i have that may leverage this (unsure tbh what the model does out of the box):

  • chat with html
  • html to selenium code eg "given this html write selenium to book a flight"
  • html ui cloner app (outlined above)

curious what others think

@pogzyb
Copy link
Author

pogzyb commented Apr 11, 2024

@xenova - sounds good! I'll try to take a crack at it, and if any community members would like to help or offer their advice, that'd be appreciated.

@jonathanpv - my main focus with the model has been to fine-tune it for cybersecurity related tasks. Here's a first draft of a fine-tuned model I trained: pogzyb/markuplm-phish. From my experience, a fine-tuned MarkupLM performed better than a fine-tuned BERT on phish/malicious website classification. The final goal of my project is to create a browser extension with the added benefit that the user's data stays local to their machine like you pointed out.

I think the html to selenium code generation is good one!
Another idea I was thinking about was automatic page re-orientation like how browsers offer "reader mode" for some websites. The model/app could optimize user experience on "clunky" pages (move text around, resize images, summarize or hide irrelevant sectons/nodes/paragraphs).
Even tutorial use-cases similar to what "WalkMe" does could be leveraged.

@jonathanpv
Copy link
Contributor

Here's a first draft of a fine-tuned model I trained: pogzyb/markuplm-phish. From my experience, a fine-tuned MarkupLM performed better than a fine-tuned BERT on phish/malicious website classification. The final goal of my project is to create a browser extension with the added benefit that the user's data stays local to their machine like you pointed out.

oh wow nice!

I think the html to selenium code generation is good one!

yep i wonder if thats all thats needed for an agent

Another idea I was thinking about was automatic page re-orientation like how browsers offer "reader mode" for some websites. The model/app could optimize user experience on "clunky" pages (move text around, resize images, summarize or hide irrelevant sectons/nodes/paragraphs). Even tutorial use-cases similar to what "WalkMe" does could be leveraged.

oh wow reader mode would be a great feature thats a good idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Request a new model
Projects
None yet
Development

No branches or pull requests

3 participants