Multilabel Book Genre Classifier 📖📚

Welcome to the Multilabel Book Genre Classifier repository! This project encompasses a comprehensive text classification model pipeline, including data collection, model training, deployment, and user-friendly web interfaces for predicting book genres based on input descriptions.

Introduction

The main objective of this project is to develop a robust text classification model capable of associating 222 distinct book genres. The keys within the deployment/genre_types_encoded.json provide insights into the top 10 genres for each book.

Data Collection and Preprocessing

The primary dataset was sourced from the esteemed website Smashwords. The data collection phase involved three key steps:

Book URL Scraping: Book URLs were systematically gathered using the scraper/smashbook_title_urls.py script. The URLs, along with their corresponding book titles, were stored in scraper/smashbook_urls_1001.csv and scraper/smashbook_urls_2001.csv.
Book Details Scraping: Leveraging the acquired URLs, comprehensive book descriptions and genres were extracted using the scraper/smashbook_details.py script. The collected data was then saved in scraper/smashbook_details1.csv and scraper/smashbook_details_2.csv.
Data Integration: The two CSV files, namely smashbook_details1.csv and smashbook_details_2.csv, were intelligently merged using the Python script in scraper/joining.py. An innovative data cleaning process was executed, resulting in the identification of multilabel genres for each book title with the resulting data stored in scraper/book-details.csv.

In total, over 27K (data file size: 44.5MB) meticulously curated book details were collected.

Model Training

The model training phase revolved around fine-tuning a distilroberta-base model from the HuggingFace Transformers library, utilizing the Fastai and Blurr frameworks. A detailed account of this process is provided in the notebooks folder.

Model Compression and ONNX Inference

The resultant trained model had a substantial memory footprint. To address this concern, the ONNX quantization technique was employed to compress the model's memory usage to a modest 78.8 MB. The process is provided in the notebooks folder.

Model Performance and Evaluation:

Accuracy: around 99%
F1 Score (Micro) = 0.6852673699527101
F1 Score (Macro) = 0.5252153455771923

Model Deployment

The compressed model is seamlessly accessible through the HuggingFace Spaces Gradio App. Detailed implementation can be found in the models folder or can be accessed directly via this HuggingFace Spaces Link.

Web Deployment

An intuitive Flask application was meticulously developed, allowing users to input book descriptions and genres to receive recommended book-cover colors as output. The live application can be accessed through this link.

Acknowledgments

Heartfelt gratitude is extended to Mohammad Sabik Irbaz and MasterCourse Bangladesh for their pivotal contributions in steering this capstone project. Their expertise, guidance, and unwavering support were crucial in shaping my skills and ensuring the successful completion of this repository. I am sincerely appreciative of their mentorship throughout this transformative journey.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
deployment		deployment
models		models
notebooks		notebooks
scraper		scraper
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

deployment

deployment

models

models

notebooks

notebooks

scraper

scraper

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Multilabel Book Genre Classifier 📖📚

Introduction

Data Collection and Preprocessing

Model Training

Model Compression and ONNX Inference

Model Performance and Evaluation:

Model Deployment

Web Deployment

Acknowledgments

About

Releases

Packages

Languages

License

NasrinRipa/multilabel-book-genre-classifier

Folders and files

Latest commit

History

Repository files navigation

Multilabel Book Genre Classifier 📖📚

Introduction

Data Collection and Preprocessing

Model Training

Model Compression and ONNX Inference

Model Performance and Evaluation:

Model Deployment

Web Deployment

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Languages