Skip to content

R-I-S-Khan/Article-Classifier-using-Apache-Spark

Repository files navigation

This project was done by the combined effort of -

Redwan Ibne Seraj Khan 
Alex Barganier         

At first put the repository in the home directory of the hadoop virtual machine.

For part 1, # The script follows the tutorial found at https://6chaoran.wordpress.com/2016/08/13/__trashed/

cd spark
./bin/spark-submit [your directory]/titanic_spark.py


For part 2,
For completing the different tasks of the lab Spark and Python was used. We collected 1540 articles related to politics, sports, entertainment and business using NYtimes API. Afterwards we filtered the data that we collected into training and testing sets. The codes used for gathering and filtering data can be found inside 'gather_data' folder. The data set that we used can be found inside 'data' folder. After collecting the data we extracted the “words” or the “features” characterizing the category. We considered top 50 words to be the features. 

Afterwards we used Naive Bayes and Random Forest to train a model for our dataset. We used our model to predict articles in our test folders. We achieved an accuracy of approximately 60%. Then we collected articles from Washington Post and checked whether our model could correctly determine the articles in Washington Post. We saw our model showed the most accurate results while predictinf political articles. The details of various outputs and how to run the codes can be found inside 'videos' folder.


To run the code, download hadoop in a virtual machine.
Place this repo in home.

Change the paths specified inside the different codes if necessary.

Then go to spark directory and enter the command given below to get the feature engineering output and machine learning model accuracy.

./bin/spark-submit [your directory]/tf_idf.py 

About

Used Machine Learning through Apache Spark to build a Newspaper Article Classifier

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages