Possible to to try to extract main article from a page? #86

vzeazy · 2023-03-26T23:32:32Z

Given the examples, struggle to understand how this may be utilized to extract the main article from a page. In this case, the sample would be the article content itself. Would be great if it could use several samples from other websites and then develop a generalized pattern for additional pages. Guessing this my be out of scope for this project.

entrptaher · 2023-05-01T06:36:50Z

The following worked for me,

wanted_dict = {
    "title": ["Possible to to try to extract main article from a page?"],
    "meta": ["vzeazy"],
    "content": ['Given the examples, struggle to understand how this may be utilized to extract the main article from a page. In this case, the sample would be the article content itself. Would be great if it could use several samples from other websites and then develop a generalized pattern for additional pages. Guessing this my be out of scope for this project.']
}

html_file = open('sample/train.html', 'r', encoding='utf-8')
source_code = html_file.read()
result = scraper.build(html=source_code, wanted_dict=wanted_dict)
scraper.save('github')

html_file = open('sample/test.html', 'r', encoding='utf-8')
source_code = html_file.read()
result=scraper.get_result_exact(html=source_code)

irthomasthomas mentioned this issue Sep 7, 2023

Scraping irthomasthomas/undecidability#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible to to try to extract main article from a page? #86

Possible to to try to extract main article from a page? #86

vzeazy commented Mar 26, 2023

entrptaher commented May 1, 2023

Possible to to try to extract main article from a page? #86

Possible to to try to extract main article from a page? #86

Comments

vzeazy commented Mar 26, 2023

entrptaher commented May 1, 2023