Skip to content

A news aggregator web app using BeautifulSoup4, Django, Django REST framework, Elasticsearch, and periodic tasks for automated updates.

License

Notifications You must be signed in to change notification settings

gibran-abdillah/news-aggregator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News Aggregator

Web App that aggregates news articles from multiple sources using BeautifulSoup4 for web scraping, Django for web development, Django REST framework for building APIs, and Elasticsearch for search functionality.

Installation

Setup the project

  • add the crontab job with details
*/30 * * * * source yourenv/activate && cd news-aggregator && python3 manage.py scrape > /path/to/log 2>&1
  • Migrate your database and build search index
    python3 manage.py makemigrations
    python3 manage.py migrate
    
    python3 manage.py search_index --rebuild

Run the app!

after everything is set up, run the django app as usual

python3 manage.py runserver

or you can use gunicorn to run the wsgi app

gunicorn newsaggregator.wsgi

finally, you can browse the api at api/

Customize Scraper

you can also create your own scraper, you just need set the title, content, and date attribute

still don't get it? check this example code :

<p class="date">13 Apr 2023</p>
<h1 class="title">This is Example title of the news article</h1>
<div class='detail-in' id='isi'>
    <p>Lorem ipsum dolor sit amet</p>
    <p>Azaret metrio zintos!</p>
</div>

all you just need is inherate the ```Spider`` class in utils/core/base.py and set the attribute

example in utils/modules/tempo.py

from utils.core.base import Spider

class TempoSpider(Spider):
    def __init__(self):

        self.base_url = [
            'https://www.tempo.co',
            'https://nasional.tempo.co',
            'https://gaya.tempo.co',
            'https://dunia.tempo.co'
            ]
        
        super().__init__(self.base_url)

        self.title_attr = {
            "name":"h1",
            "attrs":{
                "class":"title"
            }
        }
        
        self.content_attr = {
            "name":"div",
            "attrs":{
                "class":"detail-in",
                "id":"isi"
            }
        }
        self.date_attr = {
            "name":"p",
            "attrs":{
                "class":"date"
            }
        }
        

About

A news aggregator web app using BeautifulSoup4, Django, Django REST framework, Elasticsearch, and periodic tasks for automated updates.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages