Autotagging Overflow

The objective of this pipeline is to enable autotagging of stack overflow questions and answers. The model will be trained with TF-IDF with Spark MLlib in batch processing. Then the model would be used to autotag the new coming questions from Kafka (latency?). Finally the data will be persisted in Cassandra to support the front end.

DEMO

Sample datasets

Tag

Pipeline

Explore the data set

Feature Extraction for each answer or questions

Use TF-IDF to form a vector for each questions or answers:

TF(term frequency) is the frequency of a word appears in a document
IDF(inverted document frequency) is a measurement of whether a word is common or rare in the whole documents

Batch process (model_generation.py)

Extract text features using TF-IDF
Train a Naive bayes classifier to do multiclass classfication

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
autotagOverflow		autotagOverflow
README.md		README.md
cron_model_generation.sh		cron_model_generation.sh
data_exlporation.png		data_exlporation.png
evaluator.py		evaluator.py
kafka_producer.cfg		kafka_producer.cfg
kafka_producer.py		kafka_producer.py
model_generation.cfg		model_generation.cfg
model_generation.py		model_generation.py
persist_data.cfg		persist_data.cfg
persist_data.py		persist_data.py
pipeline.png		pipeline.png
redis_consumer.cfg		redis_consumer.cfg
redis_consumer.py		redis_consumer.py
sample_dataset.png		sample_dataset.png
spark-streaming-kafka-0-8-assembly_2.11-2.0.0.jar		spark-streaming-kafka-0-8-assembly_2.11-2.0.0.jar
steps		steps
streaming_prediction.cfg		streaming_prediction.cfg
streaming_prediction.py		streaming_prediction.py
tag.png		tag.png
test.py		test.py

qiaoliuhub/AutoTag

Folders and files

Latest commit

History

Repository files navigation

Autotagging Overflow

Sample datasets

Tag

Pipeline

Explore the data set

Feature Extraction for each answer or questions

Batch process (model_generation.py)

About

Topics

Resources

Stars

Watchers

Forks

Languages