Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Irrelevant content getting scrapped #538

Open
kushagrasharma-13 opened this issue Mar 3, 2024 · 0 comments
Open

Irrelevant content getting scrapped #538

kushagrasharma-13 opened this issue Mar 3, 2024 · 0 comments

Comments

@kushagrasharma-13
Copy link

kushagrasharma-13 commented Mar 3, 2024

The web content that is being scrapped from the url provided in the "01-defining-data-science" is extracting irrelevant information like navigation, random articles and refrences and causes errors in getting insights and forming wordcloud

A clear and concise description of what you want to happen.
I would like to form a solution that takes only the necessary and relevant content for further processing

We can use BeautifulSoup instead of HTMLParser and utilize its features to extract only the relevant content

Irrelevant Content:
irrelevant
Relevant Content
relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant