Skip to content

MLArtist/WebScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebScraper

WebScraper is a Python-based web scraping tool designed to crawl websites efficiently while implementing sophisticated techniques to evade website security mechanisms and prevent blocking. Whether you require data extraction for research, analysis, or any other purpose, WebScraper streamlines the web scraping process, making it both effective and user-friendly.

Features

WebScraper offers several essential features to enhance your web scraping experience:

  • Request Throttling: Avoid overwhelming target websites by intelligently throttling your requests, ensuring a respectful and non-disruptive scraping process.

  • Random Time Intervals: Implement randomized time intervals between requests to mimic human browsing behavior, reducing the likelihood of triggering website security measures.

  • User-Agent Rotation: Automatically switch User-Agents for each request to make your scraping activities appear more like legitimate user interactions.

  • IP Rotation via Proxy Server: Enable IP rotation through a proxy server to further disguise your scraping activities, making it challenging for websites to detect and block your access.

These features collectively enhance the reliability and stealthiness of your web scraping tasks, enabling you to gather data with minimal disruption and increased success rates.

Getting Started

Follow these instructions to get a copy of WebScraper up and running on your local machine.

Prerequisites

Make sure you have the following prerequisites installed:

  • Python 3.x
  • Pip (Python package manager)

Installation

Direct Installation

  1. Clone this repository to your local machine:

    git clone https://github.com/MLArtist/WebScraper.git
  2. Navigate to the project directory:

    cd WebScraper
  3. Install the required Python dependencies:

    pip install -r requirements.txt

Now, you're ready to start using WebScraper!

Docker Installation

If you prefer to use Docker for installation, follow these steps:

  1. Make sure you have Docker installed on your system.

  2. build the WebScraper Docker image:

    docker build -t webscraper -f Dockerfile .

Now, you can use WebScraper in a Docker container!

Usage

Direct Usage

To start scraping, use the following command:

cd webscraper
python -m webscraper URL

Replace URL with the URL of the website you want to scrape.

If you wish to start the crawling process from a supplied address and clear any previously scraped data, you can use the following command:

cd webscraper
python -m webscraper URL --start_afresh true

Replace URL with the URL of the website you want to scrape.

Via Docker

To run WebScraper using Docker, execute the following command:

docker run -d -v $(pwd):/app -w /app/webscraper webscraper \
python -m webscraper URL

Replace URL with the URL of the website you want to scrape.

Output

WebScraper generates output in the form of JSON files, which are stored in the /data/ directory. Each JSON file contains the raw HTML content of a webpage in the following format:

{
  "url": "URL of the webpage",
  "content": "Raw HTML content of the webpage associated with the URL"
}

Acknowledgments

  • Special thanks to the open-source community for providing valuable libraries and tools that made this project possible.

Note

Web scraping should always be done responsibly and in compliance with the website's terms of service and legal regulations. Before using WebScraper, make sure you have the necessary permissions to scrape data from the targeted website.

Additionally, keep in mind that scraping large amounts of data or scraping too frequently from a website can put strain on the site's resources and may result in IP bans or legal action. Please use WebScraper responsibly and ethically.

The maintainers of this repository are not responsible for any misuse or legal consequences arising from the use of WebScraper. Users are encouraged to familiarize themselves with web scraping best practices and legal guidelines before using this tool.