Skip to content

Crawling news and information website and anticipating the likelihood of its virality.

Notifications You must be signed in to change notification settings

Vishal1999-33/Online-News-Popularity

Repository files navigation

About

The main objective of this project was to crawl news and information websites and then anticipate the probability of them getting popular. For this purpose, pretrained Random Forest Regression Model was used. The dataset used for this purpose was taken from the UCI repository. It contains 61 attributes (58 predictive attributes, 2 non-predictive, 1 goal field). They are as follows:-

  1. url: URL of the article (non-predictive)
  2. timedelta: Days between the article publication and the dataset acquisition (non-predictive)
  3. n_tokens_title: Number of words in the title
  4. n_tokens_content: Number of words in the content
  5. n_unique_tokens: Rate of unique words in the content
  6. n_non_stop_words: Rate of non-stop words in the content
  7. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
  8. num_hrefs: Number of links
  9. num_self_hrefs: Number of links to other articles published by Mashable
  10. num_imgs: Number of images
  11. num_videos: Number of videos
  12. average_token_length: Average length of the words in the content
  13. num_keywords: Number of keywords in the metadata
  14. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
  15. data_channel_is_entertainment: Is data channel 'Entertainment'?
  16. data_channel_is_bus: Is data channel 'Business'?
  17. data_channel_is_socmed: Is data channel 'Social Media'?
  18. data_channel_is_tech: Is data channel 'Tech'?
  19. data_channel_is_world: Is data channel 'World'?
  20. kw_min_min: Worst keyword (min. shares)
  21. kw_max_min: Worst keyword (max. shares)
  22. kw_avg_min: Worst keyword (avg. shares)
  23. kw_min_max: Best keyword (min. shares)
  24. kw_max_max: Best keyword (max. shares)
  25. kw_avg_max: Best keyword (avg. shares)
  26. kw_min_avg: Avg. keyword (min. shares)
  27. kw_max_avg: Avg. keyword (max. shares)
  28. kw_avg_avg: Avg. keyword (avg. shares)
  29. self_reference_min_shares: Min. shares of referenced articles in Mashable
  30. self_reference_max_shares: Max. shares of referenced articles in Mashable
  31. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
  32. weekday_is_monday: Was the article published on a Monday?
  33. weekday_is_tuesday: Was the article published on a Tuesday?
  34. weekday_is_wednesday: Was the article published on a Wednesday?
  35. weekday_is_thursday: Was the article published on a Thursday?
  36. weekday_is_friday: Was the article published on a Friday?
  37. weekday_is_saturday: Was the article published on a Saturday?
  38. weekday_is_sunday: Was the article published on a Sunday?
  39. is_weekend: Was the article published on the weekend?
  40. LDA_00: Closeness to LDA topic 0
  41. LDA_01: Closeness to LDA topic 1
  42. LDA_02: Closeness to LDA topic 2
  43. LDA_03: Closeness to LDA topic 3
  44. LDA_04: Closeness to LDA topic 4
  45. global_subjectivity: Text subjectivity
  46. global_sentiment_polarity: Text sentiment polarity
  47. global_rate_positive_words: Rate of positive words in the content
  48. global_rate_negative_words: Rate of negative words in the content
  49. rate_positive_words: Rate of positive words among non-neutral tokens
  50. rate_negative_words: Rate of negative words among non-neutral tokens
  51. avg_positive_polarity: Avg. polarity of positive words
  52. min_positive_polarity: Min. polarity of positive words
  53. max_positive_polarity: Max. polarity of positive words
  54. avg_negative_polarity: Avg. polarity of negative words
  55. min_negative_polarity: Min. polarity of negative words
  56. max_negative_polarity: Max. polarity of negative words
  57. title_subjectivity: Title subjectivity
  58. title_sentiment_polarity: Title polarity
  59. abs_title_subjectivity: Absolute subjectivity level
  60. abs_title_sentiment_polarity: Absolute polarity level
  61. shares: Number of shares (target)

To fit our data into this model, we will have to calculate all the above mentioned attributes related to our data. Detail of those calculations is discussed in the code.

What is Latent Dirichlet allocation(LDA)

For the calulation of attributes like "LDA_00", "LDA_01", "LDA_02", "LDA_03", "LDA_04", we will have to make a LDA model. In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox. The LDA algorithm is applied in order to first identify the five top relevant topics and then measures the closeness of each articles to such topics.

What are non-neutral tokens

Non-neutral tokens are those tokens which are more likely to exhibit the following POS Tags: Nouns, Adjectives, Adverbs, Interjections. In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

To know all the possible POS tags in NLTK visit following:

https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk/38264311#38264311