Skip to content

qiaoliuhub/AutoTag

Repository files navigation

Autotagging Overflow

The objective of this pipeline is to enable autotagging of stack overflow questions and answers. The model will be trained with TF-IDF with Spark MLlib in batch processing. Then the model would be used to autotag the new coming questions from Kafka (latency?). Finally the data will be persisted in Cassandra to support the front end.

DEMO

Sample datasets

Tag

Pipeline

Explore the data set

Feature Extraction for each answer or questions

Use TF-IDF to form a vector for each questions or answers:

  1. TF(term frequency) is the frequency of a word appears in a document
  2. IDF(inverted document frequency) is a measurement of whether a word is common or rare in the whole documents

Batch process (model_generation.py)

  1. Extract text features using TF-IDF
  2. Train a Naive bayes classifier to do multiclass classfication

Releases

No releases published

Packages

No packages published