News Aggregator

Web App that aggregates news articles from multiple sources using BeautifulSoup4 for web scraping, Django for web development, Django REST framework for building APIs, and Elasticsearch for search functionality.

Installation

Install 'elasticsearch' https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html
Install 'redis' https://redis.io/docs/getting-started/installation/

Clone this repository

  git clone https://github.com/gibran-abdillah/news-aggregator

Create your virtual environment and go to the project directory

virtualenv env/
source env/bin/activate

cd news-aggregator
pip3 install -r requirements.txt

Setup the project

add the crontab job with details

*/30 * * * * source yourenv/activate && cd news-aggregator && python3 manage.py scrape > /path/to/log 2>&1

Migrate your database and build search index

python3 manage.py makemigrations
python3 manage.py migrate

python3 manage.py search_index --rebuild

Run the app!

after everything is set up, run the django app as usual

python3 manage.py runserver

or you can use gunicorn to run the wsgi app

gunicorn newsaggregator.wsgi

finally, you can browse the api at api/

Customize Scraper

you can also create your own scraper, you just need set the title, content, and date attribute

still don't get it? check this example code :

<p class="date">13 Apr 2023</p>
<h1 class="title">This is Example title of the news article</h1>
<div class='detail-in' id='isi'>
    <p>Lorem ipsum dolor sit amet</p>
    <p>Azaret metrio zintos!</p>
</div>

all you just need is inherate the ```Spider`` class in utils/core/base.py and set the attribute

example in utils/modules/tempo.py

from utils.core.base import Spider

class TempoSpider(Spider):
    def __init__(self):

        self.base_url = [
            'https://www.tempo.co',
            'https://nasional.tempo.co',
            'https://gaya.tempo.co',
            'https://dunia.tempo.co'
            ]
        
        super().__init__(self.base_url)

        self.title_attr = {
            "name":"h1",
            "attrs":{
                "class":"title"
            }
        }
        
        self.content_attr = {
            "name":"div",
            "attrs":{
                "class":"detail-in",
                "id":"isi"
            }
        }
        self.date_attr = {
            "name":"p",
            "attrs":{
                "class":"date"
            }
        }

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
news		news
newsaggregator		newsaggregator
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
manage.py		manage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

news

news

newsaggregator

newsaggregator

utils

utils

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

manage.py

manage.py

requirements.txt

requirements.txt

Repository files navigation

News Aggregator

Installation

Setup the project

Run the app!

Customize Scraper

About

Releases

Packages

Languages

License

gibran-abdillah/news-aggregator

Folders and files

Latest commit

History

Repository files navigation

News Aggregator

Installation

Setup the project

Run the app!

Customize Scraper

About

Topics

Resources

License

Stars

Watchers

Forks

Languages