Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to to try to extract main article from a page? #86

Open
vzeazy opened this issue Mar 26, 2023 · 1 comment
Open

Possible to to try to extract main article from a page? #86

vzeazy opened this issue Mar 26, 2023 · 1 comment

Comments

@vzeazy
Copy link

vzeazy commented Mar 26, 2023

Given the examples, struggle to understand how this may be utilized to extract the main article from a page. In this case, the sample would be the article content itself. Would be great if it could use several samples from other websites and then develop a generalized pattern for additional pages. Guessing this my be out of scope for this project.

@entrptaher
Copy link

The following worked for me,

wanted_dict = {
    "title": ["Possible to to try to extract main article from a page?"],
    "meta": ["vzeazy"],
    "content": ['Given the examples, struggle to understand how this may be utilized to extract the main article from a page. In this case, the sample would be the article content itself. Would be great if it could use several samples from other websites and then develop a generalized pattern for additional pages. Guessing this my be out of scope for this project.']
}

html_file = open('sample/train.html', 'r', encoding='utf-8')
source_code = html_file.read()
result = scraper.build(html=source_code, wanted_dict=wanted_dict)
scraper.save('github')

html_file = open('sample/test.html', 'r', encoding='utf-8')
source_code = html_file.read()
result=scraper.get_result_exact(html=source_code)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants