Skip to content

LateNight01s/ngbliving_chatbot

Repository files navigation

NGB Living ChatBot

NGB ChatBot - webapp

Prototype:

Code: ngb_chatbot.py
Demo: Video

Run locally:

git clone https://github.com/LateNight01s/ngbliving_chatbot
cd ./ngbliving_chatbot
docker build -t chatbot ./backend/
docker run --rm -p 8080:8000 chatbot:latest
Go to http://localhost:8080

About ChatBots

Types

  • Retrieval-based approach (goal-oriented, narrow, predefined-responses)
  • Generative model (chit-chat, general, commonsense)

Generative models are not yet Turing complete, require large amount of data. The SOTA generative models are very large (GPT3-175B, Meena-2.1B) and are for general purpose not for specific domain.

Retrieval-based models are goal-oriented, require domain specific data. There are many approaches involved, i.e, similarity functions with TF-IDF, Dual encoder LSTM, classifier models, Knowledge graphs.

For a website like NGB Living that offers their services to customers, a hybrid approach using both of these two options would work the best.

Knowledge Graph

KG as the name suggests is a graph based structured data with entities as nodes and their relationship with other entities defined by an edge in the graph.

Example of a KG

triple: (Leonard Nimoy, played, Spock), (Spock, character in, Star Trek)

KG can be constructed from unstructured text using various NLP methods like Named Entity Recognition (NER), Keyword Extraction, Sentence Segmentation, etc.

KG are widely used in NLP based system like intelligent chatbots, cognitive search system, QA application, etc. Google Knowledge Graph is the knowledge base that Google uses to enhance it's search algorithm thats how Google Assistance seems so intelligent.

Bag of Words

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

In this approach, we look at the histogram of the words within the text, i.e. considering each word count as a feature.

Bag of Words

Term Frequency - Inverse Document Frequency (TF-IDF)

It is a numerical statistics which is used in information retrieval process that defines the importance of a word in a document that is part of a collection or corpus.

Formula

TF-IDF Formula

Cosine Similarity

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

Formula

Cosine Similarity Formula

Example

Cosine Similarity Example

Coreference Resolution (coref)

In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions in a text refer to the same person or thing; they have the same referent, e.g. Bill said he would come; the proper noun Bill and the pronoun he refer to the same person, namely to Bill.

Corefence Resolution Example

Word Embedding (Word2Vec, GloVe)

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers

Word Embedding Example

SentenceBERT

Sentence-BERT, presented in Reimers & Gurevych, 2019 aims to adapt the BERT architecture by using siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.

The SBERT architecture in training on a classification objective (left) and inference (right)

References