We often have plenty of unstructured data available for free on the internet. Some of this data may be useful for combining with other structured or unstructured data available in the organization.
The project aims to automate the process of gathering unstructured (raw HTML) finance data using Python library BeautifulSoup & transform into structured data JSON and save as CSV
-
Automate the process of gathering unstructured data which is in the form of raw HTML.
-
Learn to web scrap Financial News of specific listed companies on the Stock Market.
-
Use BeautifulSoup4 Python library for web scraping - Install, Exception Handling, Advanced HTML Parsing.
-
How to traverse a single domain to fetch data from many HTML pages.
-
Process gathered (scrapped) data and transform it into structured format JSON and save as CSV.
pip install --upgrade pip
pip install -r requirements.txt
Create a sub-directory 'content' in project-directory to save CSV files
- Identify the target website
- Collect URLs of the pages where you want to extract data from
- Make a request to these URLs to get the HTML of the page
- Use locators to find the data in the HTML
- Save the data in a JSON or CSV file or some other structured format