Skip to content

Utshav-paudel/300DaysOFMachineLearning-DeepLearning-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

300 DAYS OF MACHINE LEARNING , DEEP LEARNING AND LARGE LANGUAGE MODELS

machine learning image

Books and Resources Status of Completion
1. Machine Learning Specialization
2.Hands-On Machine Learning with Scikit-Learn and TensorFlow
3.Intro to DeepLearning
4.Deep Learning Specialization
5.LLM from Scratch
6.Hugging face Nlp course
7.LLM course 🏊
8.Efficiently Serving LLMs) 🏊
9.Langchain docs 🏊
Projects Completed
1. Medical Insurance Price Prediction
2.Iris Flower Classification
3.California Housing Price Prediction
4.Collabrative filtering: Book Recommender Webapp
5.CNN: Bird Species Classification
6.CNN Transfer Learning: Messy-or-CleanRoom-Detection
7.Data Augmentation
8.YOLO From Scratch
9.U-NET From Scratch
10.LLM From Scratch
11.Shakespeare text generation
12.Neural Style Transfer Webapp
13.Langchain : Petname Generator
14.Langchain : YouTube Assistant
15.Mistarl7B-Question-Answer-on-your-data
16.Company_recommender_LLM
17.Local LLM : DocBot
18.Fine Tuning Mistral7B on google colab
19.Multiclass Image Classification: Inception V3

Topics Learned in each day

Days Topics Covered Resources
Day1 Superviesed learning, regression, classification Machine Learning Specialization
Day2 Univariate Linear regression, Cost function Machine Learning Specialization
Day3 Gradient descent Machine Learning Specialization
Day4 Learning rate Machine Learning Specialization
Day5 Multiple linear regression, Vectorization Machine Learning Specialization
Day6 Feature scaling, Choosing correct learning rate Machine Learning Specialization
Day7 Feature engineering, Polynomial regression Machine Learning Specialization
Day8 Classification, Logistic regression Machine Learning Specialization
Day9 Sigmoid function, Decision boundary Machine Learning Specialization
Day10 Gradient descent in Logistic regression, Cost function in Logistic regression Machine Learning Specialization
Day11 Gradient descent in logistic regression Implementation Machine Learning Specialization
Day12 Underfitting,Overfitting, Addressing overfitting, Plotting overfitting,Regularization implementation Machine Learning Specialization
Day13 Neural Network Introduction , Why neural network? Machine Learning Specialization
Day14 Neural Network notation, forward propagation, Neuron Layer implementation Machine Learning Specialization
Day15 Neural network implementation for digit classification, Classification of AI Machine Learning Specialization
Day16 Vectorization in Neural Network , Neural network of Handwritten Binary Digit Classification Machine Learning Specialization
Day17 Model Training Steps, Activation Function , Implementation of ReLU Machine Learning Specialization
Day18 Multi Class classification, soft max regression, cost for softmax regression Machine Learning Specialization
Day19 Improved Implementation of softmax/logistic regression in neural network,multilabel classification, Advanced optimization, Additional layer types Machine Learning Specialization
Day20 BackpropagationImplementation of Backpropagation, Debugging a learning algorithm, Model selection and Machine learning diagnostic Machine Learning Specialization
Day21 Bias/Variance , choosing regularization parameter Machine Learning Specialization
Day22 Diagonising Bias and Variance ,Labs on Diagonising Bias and Variance , Choosing regularization parameter Machine Learning Specialization
Day23 Iterative loop of ML Development, Error analysis, Transfer Learning Machine Learning Specialization
Day24 Full cycle of Machine learning projects,Precison and Recall , Trading off precison and recall , Lab on Full Machine Learning Cycle Machine Learning Specialization
Day25 Decision Tree, Decision Tree Learning Machine Learning Specialization
Day26 Measuring Impurity, Information Gain, Decision Tree Learning, Recursive Splitting Machine Learning Specialization
Day27 One hot encoding, Splitting for continous variable, Regression Tree Machine Learning Specialization
Day28 Tree ensemble , Random Forest Algorithm, XG boost, when to use decision tree Machine Learning Specialization
Day29 Unsupervised Learning, K means clustering Algorithm, cost function for k means clustering,Labs on K means Clustering Machine Learning Specialization
Day30 Anamoly detection, Anamoly detection vs supervised Learning use case Machine Learning Specialization
Day31 Recommender system, Content based Recommendation, Collaborative Filtering Recommender Systems Machine Learning Specialization
Day32 Normalization, Limitation of Collaborative Filtering,Lab Collabrative filetering recommender system, Content base recommendation for large items Machine Learning Specialization
Day33 Tensorflow implementation of Collaborative Filtering,Dimensonality Reduction, PCA Machine Learning Specialization
Day34 Step by step calculation of PCA , Implementation of PCA, Machine Learning Specialization
Day35 Reinforcement Learning, Markov Decision Process Machine Learning Specialization
Day36 State Action Value Function, Bellman Function, Random stochastic environment,State Action value function Implementation Machine Learning Specialization
Day37 Discrete State and Continuous State, Refinement of reinforcement learning by minibatches and softupdate Machine Learning Specialization
Day38 Building a Book Recommender System using Collaborative Filtering Machine Learning Specialization
Day39 California Housing Price Prediction : Batch learning vs online learning, Cost for Linear regression (RMSE and MAE) usecase,fetching and loading of data with EDA Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day40 California Housing Price Prediction (Continued) : Created test data , and splitted data on the basis of train-test-split and also with stratifcation split to remove imbalance in data and create same proportion. Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day41 California Housing Price Prediction (Continued) : data visualization, EDA Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day42 California Housing Price Prediction (Continued) : feature engineering, using simple imputer, handling categorical data by encoding Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day43 California Housing Price Prediction (Continued) :Feature Scaling and Bucket Binning Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day44 California Housing Price Prediction (Continued) : Data preprocessing pipeling development Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day45 California Housing Price Prediction (Continued) : Selection, training and evaluation of model Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day46 Binary Classification, measuring accuracy using Confusion matrix and ROC curve Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day47 Multiclass classification, Multilablel classification, Multioutput classification Classification Implementation Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day48 Linear Regression, Gradient descent, Stochastic Gradient descent and SGD regressor, Implementation Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day49 Polynomial Regression, Learning curve, overfitting , underfitting and its solution Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day50 Ridge Regression and its Implementation with SGD Regressor Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day51 Lasso Regression, elastic net regression and early stopping Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day52 Revision on logistic regression and softmax regression, logloss, Implementing logistic regression Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day53 SVM, kernel function and kernel trick Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day54 Polynomial kernel and RBF kernel Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day55 Support Vector Machine and It classes Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day56 Decision Tree and regularization in decision tree and its implementation Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day57 Decision Tree for regression , hyperparameter tuning and its implementation. Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day58 Ensemble Learning and Voting classifier Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day59 Bootstrap Aggregation Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day60 Random patches and random subspaces Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day61 Random Forest Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day62 Boosting Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day63 History based gradient boosting and stacking Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day64 Dimensionality Reduction and PCA Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day65 Local Linear embeddings and K means Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day66 Supervised Learning in Neural Network Deep Learning Specialization
Day67 Image Classifier using sequential API Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day67 Image Classifier using sequential API Hands-On Machine Learning with Scikit-Learn and TensorFlow
Day68 Vectorization and Broadcasting in python Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day1

1. Supervised learning

Learns from being given right answers.
Supervised machine learning is based on the basis of labeled data.First the data is fed to the model with both input and output and later on test data is given to make prediction by model. some algorithm used in supervised learning with their uses are :

  • Regression : House price prediction.
  • Classification : Breast cancer detection.

2. Unsupervised learning

Learns by finding pattern in unlabelled data. Unsupervised learning is different from supervised learning as it is not provided with labelled data.The algorithm work by finding pattern in data.
some algorithm used in unsupevised learning with it uses are:

  • Clustering : Grouping similar data points together e.g: grouping of customer , grouping of news,DNA microarray.
  • Anomlay detection: Finding unusal data points e.g: fraud detection , quality check.
  • Dimensionality reduction : Compress data using feweer numbers e.g : Image processing.
  • 📚Resources
    course:Machine Learning Specialization

Day2

Univariate Linear regression

Univariate linear regression has one dependent variable and one independent variable. With the help of indendent variable also known as input,feature we predict the output. Firstly we provide training set to our model and later on we predict the output using training set.

Cost function

A cost function is a measure of how well a machine learning model performs by quantifying the difference between predicted and actual outputs.
lower the value of cost function better the model
cost function

Day3

Gradinet descent

gradient descent img
Gradient descent is an algorithm for finding values of parameters w and b that minimize the cost function J.It is made cleared in below image.
gradient descent equation img

Day4

Learning rate

Learning rate alpha in gradient descent should be optimal.

  • If learning rate is too small gradient descent may be too slow and take much time.

  • If learning rate is too large gradient descent may overshoot and never reach minimum i.e fail to converge,diverge.
    learning rate

  • 📚Resources
    course:Machine Learning Specialization

Day5

Multiple linear regression

Multiple linear regression in machine learning model that uses multiple variables called as features to predicts the output. multiple linear regression model

Vectorization

In muliple linear regression calculation is done using vectorization as it perform all calculation simultaneously and parallely and speed up the arithmetic operations. vectorization img

Day6

Feature scaling

When you data features has very large range,too small range gradient descent may take large time so data is rescaled to normal similar range called feature scaling. some popular feature scaling techniques are:

  • mean normalization
  • Z score normalization
    feature scaling type
    Feature scaling visual representation
    feature scaling representation

Choosing correct learning rate

First we make sure gradient descent is decreasing over the iteration by looking at learning curve if it is working properly we choose correct learning rate by starting with smaller learning rate and increase it gradually.

Day7

Feature engineering

Feature engineering means designing newfeatures by transforming or combining original features which maybe very important in prediciting the output.
for e.g: we have to predict the price of swimming pool and we have length breadth and height of swimming pool as features now we can used feature engineering to create our new feature which is volume which is very important in predicting the price of swimming pool.

Polynomial regression

Polynomial Regression is a regression algorithm that models the relationship between a dependent(y) and independent variable(x) as nth degree polynomial. The Polynomial Regression equation is given below:
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n
It is used incase of non linear dataset.

Day8

Classification

Classification is a type of supervised learning in machine learning, where the goal is to predict the class label of an input data point.For example, we may want to classify emails as spam or not spam, or classify images as cats or dogs.

Logistic regression

Logistic regression is a type of algorithm used for classification problems. It works by estimating the probability of an input data point belonging to a particular class. For example, it may estimate the probability that an email is spam or not spam, or the probability that an image is a cat or a dog.

To estimate these probabilities, logistic regression uses a mathematical function called the logistic function, which maps the input data to the probability space. The logistic regression algorithm then learns the relationships between the input features and the target class by adjusting weights, or coefficients, assigned to each input feature. These weights are adjusted to maximize the probability of the correct classification.

In the end, logistic regression outputs the predicted class for each input data point, based on the estimated probabilities. This can be useful for a wide range of classification tasks, from predicting diseases to detecting fraud. logistic reg img

Day9

Sigmoid function

The sigmoid function is a mathematical function that maps any input value to a value between 0 and 1. It is commonly used in logistic regression to model the probability of a binary outcome. The sigmoid function has an S-shaped curve and is defined as follows:

σ(z) = 1 / (1 + e^(-z))

where z is the input value to the function. The output of the sigmoid function, σ(z), is a value between 0 and 1, with a midpoint at z=0.

The sigmoid function has several important properties that make it useful in logistic regression. First, it is always positive and ranges between 0 and 1, which makes it suitable for modeling probabilities. Second, it is differentiable, which means that it can be used in optimization algorithms such as gradient descent. Finally, it has a simple derivative that can be expressed in terms of the function itself:

d/dz σ(z) = σ(z) * (1 - σ(z))

This derivative is used in logistic regression to update the model coefficients during the optimization process.

Decision boundary

The decision boundary is the line that separates the area where y=0 and where y=1.It is create by our hypothesis function. In logistic regression, the decision boundary is the line (or hyperplane in higher dimensions) that separates the different classes of the target variable. The decision boundary is determined by the logistic regression model, which uses the input variables to predict the probability of belonging to a certain class.
decision boundary image

Day10

Gradient descent in Logistic regression

Logistic Regression Ŷi is a nonlinear function(Ŷ=1​/1+ e-z), if we put this in the above MSE equation it will give a non-convex function as shown:
loss function image

  • When we try to optimize values using gradient descent it will create complications to find global minima.

  • Another reason is in classification problems, we have target values like 0/1, So (Ŷ-Y)2 will always be in between 0-1 which can make it very difficult to keep track of the errors and it is difficult to store high precision floating numbers.

The cost function used in Logistic Regression is Log Loss.
log loss image
Cost function for logistic regression
cost function image

Day11

Gradient Descent in logistic regression

Gradient Descent in Logistic Regression is an iterative optimisation algorithm used to find the local minimum of a function. It works by tweaking parameters w and b iteratively to minimize a cost function by taking steps proportional to the negative of the gradient at the current point.
Gradient descent in logistic regression looks similar to gradient descent in linear regression but it has different value for function.
gradient descent in logistic regression

Day12

Underfitting

It is a situtation when the training set doesnot fit well. It happen when data has high bias.

Overfitting

It is a situation when the training set fit extremely well . It is also known as data with high variance.

Addressing overfitting

  • Collecting more training example
  • Select features include/exclude
  • Reduce the size of parameters i.e "Regularization".
  • overfitting

Regularization

Regularization is a technique to reduce the parameter and prevent overfitting of data. It has a term called lambda whose value if larger result underfitting and smaller result overfitting it also called penalty term.
regularization term

Day13

Neural network

Neural network is an computer algorithms that try to mimic the brain.neural network is made of a input layer that take input data and hidden layer does all the computation and output layer displays the output.
neural network image
Why neural network ?
Neural network is necessary because it increase performance of machine learning algorithm compared to traditional algorithm like linear regression and logistic regression because it uses multiple and more algorithm in a neural network to make better prediction and performances.
why neural network

Day14

Neural network notation

In neural network.

  • neuron is represneted by subscript.
  • neural network layer is represented by superscript.

Forward propagation in neural network

Forward propagation refers to storage and calculation of input data which is fed in forward direction through the network to generate an output. Hidden layers in neural network accepts the data from the input layer, process it on the basis of activation function and pass it to the output layer or the successive layers. Data flows in forward direction so as to avoid circular shape flow of data which will not generate an output. The network configuration that helps in forward propagation is known as feed-forward network.

Day15

Neural network implementation in tensorflow

Neural network can be easily implemented in tensorflow as below:

AGI

AI is mainly classified into two type: ANI and AGI

  • AGI:An AGI is a hypothetical intelligent agent that can learn to accomplish any intellectual task that human beings or other animals can perform. It is defined as an autonomous system that surpasses human capabilities in the majority of economically valuable tasks

  • 📚Resources
    course:Machine Learning Specialization

Day16

Vectorization in neural network

In neural network vectorization helps to perform calculation simultaneously and save a lot of time. It can implemented as :
vectorization in neural network

Neural network implementation in code

neurall network representation
code

Day17

Model Training steps

Model training is simplified in 3 steps :

  1. Specify how to compute output given input x and parameters w,b (define model)
  • linear regression (y = ax + b)
  • logistic regression (y = 1/(1 + np.expt(-z)))
  1. Specify loss and cost
  • Mean square error
  • Logistic loss (BinaryCrossentropy) (loss = -y*np.log(f_x) - (1-y)*np.log(1-f_x)
    Note: Cost is the sum of loss for all training examples.
  1. Train on data to minimize cost(Gradient descent)
    w = w - alphadj_w
    b = b - alpha
    dj_b

Activation function

There are different activation function for different purpose some of the most commonly used are :

  • Linear acitvation function(activation='linear')
    This is used for regression problem where result is negative/positive.
  • ReLU (activation = 'ReLU')
    This is used for regression problem where result should be positive always and it is faster as compared to sigmoid function.
  • Sigmoid function(activation='sigmoid')
    It is used for classification problems where result must be on/off and it is slowere as compared to ReLU.
    NOTE:For hidden layer we choose ReLU as activation and for output layer we choose activation according to our problems,because if we choose sigmoid in hidden layer than neural network becomes very slow so it better to choose Relu in hidden layer
    activation function
  • ReLU implementaion
  • 📚Resources
    course:Machine Learning Specialization

Day18

Multiclass classification

Target y can take on more than two possible values. In this case of multiclass classification we use Softmax regression.

Softmax regression

Softmax regression is the generalization of logistic regression for multiple classs.
Its output is calculated as:
Softmax regression

Cost for softmax regression

Cost for softmax regression is also known as cross-entropy loss. It is obtained as. Cost for softmax regression

Day19

Improved Implementation of softmax/logistic regression in neural network

Our normal implementation of softmax cause some of numerical roundoff error so for the more numerical accurate implementation of softmax regression we use linear activation in output layer and passing from_logits = True as parameter in loss at model.compile().
You can get more insight by looking at image below:
Numerical accurate implementation of softmax

MultiLabel classification

Multilabel classification is a type of classification problem in machine learning where each instance can be assigned to multiple classes or labels simultaneously. In other words, instead of predicting a single class for an instance, the goal is to predict a set of labels that are applicable to that instance.
Here is difference between multiclass and multilable classfication ![multi label classification](Multilabel classification is a type of classification problem in machine learning where each instance can be assigned to multiple classes or labels simultaneously. In other words, instead of predicting a single class for an instance, the goal is to predict a set of labels that are applicable to that instance.)

Advanced optimization

Adam algorithm is used for advanced optimization in neural network.

  • If learning rate is smaller adam algorithm increases it automatically.
  • If learning rate is larger then adam algorithm decreases it automatically. Note : It stands for Adaptive Moment estimation. It is used as:
model.compile( optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3), loss = tf.kearas.losses.SparseCategoricalCrossentropy(from_logits=True))

Additional Layer types:

Some of layer types of neural network are :

  • Dense Layer (Fully Connected Layer): A dense layer is a basic layer where each neuron is connected to every neuron in the previous layer. It is characterized by its weight matrix, bias vector, and activation function. Dense layers are commonly used in feedforward neural networks and can learn complex patterns and relationships in the data.

  • Convolutional Layer: Convolutional layers are primarily used in convolutional neural networks (CNNs) for analyzing grid-like data, such as images. These layers perform convolutions, applying filters to the input data, and capturing local patterns and features. Convolutional layers are effective in image recognition, object detection, and other computer vision tasks.
    Convaulational neural network are faster in computation and need less training data as compared to Dense Layer.

  • 📚Resources
    course:Machine Learning Specialization

Day20

Back Propagation

Backpropagation, or backward propagation of errors, is an algorithm used in machine learning to adjust the parameters of a neural network by calculating the gradients of a loss function with respect to the network's weights and biases. It propagates the error from the output layer to the input layer, allowing the network to learn and improve its predictions.

Debugging a learing algorithm

When we have large error in prediction we can debugg or learning algorithm as follow:

  • Get more training examples.
  • Try smaller set of features.
  • Try getting additional features.
  • Try adding polynomial features.
  • Try decreasing/increasing lamda regularizing parameter

Evaluating a choosen model:

You can evaluate a model by splitting data into trian/test and calculating cost for both training set and test set .

Model selection:

The most effective way of model selection is by

  • splitting data into train/cross validation /test set
  • Calculating error for cross validation set and selecting model with less cross validation error .
  • Calculation error for test data of model with less cross validation error.
  • Model selection using train/cv/test

Machine learning diagnostic

A test that you can run to gain insight into what is/isn't working with a learning algorithm to gain guidance into improving its performance . Ml model can be diagonse by looking at bias and variance:
When model has high bias and variance it is not doing well.
how to diagnose high bias and variance

Day21

Bias/Variance

  • High bias: When model has large difference between baseline performance and Training error then it is called high bias and it also indicates underfitting.
  • High variance: When model has large difference between training error and cross validation error then it is called high variance and it also indicates overfitting.
  • High bias and variance: When model has large difference between training error , cross validation error and baseline performance then it is called both high bias and high variance.
    image show bias and variance

Choosing regularization parameter

To choose good regularization paramter.

  • First,Apply all regularization value and get different cross validation error and the smallest cross validation error indicated a good regularization term.
  • NOTE : Right model neither has high variance and neither has high bias
  • 📚Resources
    course:Machine Learning Specialization

Day22

Diagonising bias and variance

If your algorithm has high bias:

  • Try getting additional features.
  • Try increase polynomial degree.
  • Try decreasing regularization term.

If your algorithm has high variance:

  • Getting more training examples.
  • Trying decreasing set of features.
  • Try increasing regularization term.

Bias and Variance in neural network:

  • If your neural network has high bias try increasing size of neural network.
  • If your neural network has high variance try increasing training sets. Note : Most probably larger neural network perform well as long as appropriate regularization term is choosen. bias and varince in neural net
  • Diagonising bias and variance
  • 📚Resources
    course:Machine Learning Specialization

Day23

Iterative loop of ML Development

Ml development revolve around following steps:

  • Choosing architecture(model,data,etc)
  • Training model
  • Diagnostics(bias,variance and error analysis) iterative loop img

Error analysis

It is the process to isolate,observe and diagnose erroneous ML predictions to understand pockets of high and low performance to the model.

Adding more data

Adding more data is mostly useful to make better predictions and data can be added by following ways:

  1. Data augmentation: Modifying an existing training example to create new training example. e.g: Data augmentation by adding distortion.
  2. Data synthesis: Using artifical data inputs to create a new training example. It is mostly used for computer vision applications.

Transfer learning

Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task.
It is a popular approach in deep learning where pre-trained models are used as the starting point on computer vision and natural language processing tasks given the vast compute and time resources required to develop neural network models on these problems and from the huge jumps in skill that they provide on related problems.

  • Download neural network parameters pretrained on large dataset with same input type (e.g: images,audio,text) as your application (or train your own).
  • Further train(fine tune) the network on your own data.
  • 📚Resources
    course:Machine Learning Specialization

Day24

Full cycle of machine learning project

Machine learning project is iterative process which is as below:
ml lifecycle

Deployment

Mlops focuses on making ml model to be used in largescale and deployment is basically done by: ml deployment

ethics,bias and faireness of machine learning

While developing machine learning application we have to take care of biasness and negative case like :
1.Deepfake
2.Genrating fake content for commercial and political purposes
3.Ml model biasing in loan provider,job selection.

Precision/recall

  • precision : It tell of all positive prediction how many are actually positve.
  • recall : It tell of all real positive cases how many are actually predicted positive. precsion recall img

Trading off precsion and recall

  • When threshold is higher, precision become higher and recall lower down
  • When threshold is lower, precision become lower and recall become higher. NOTE : To select better precsion and recall value we use F1 score which is the harmonic mean of precision and recall.
  • Summary of Advance learning algorithm
  • 📚Resources
    course:Machine Learning Specialization

Day25

Decision Tree

A decision tree is a type of supervised machine learning used to categorize or make predictions based on how a previous set of questions were answered. The model is a form of supervised learning, meaning that the model is trained and tested on a set of data that contains the desired categorization. decision tree image

Decision tree learning

  1. How to choose what feature to split at each node ?
  • Maximize purity
  1. When to stop splitting
  • When a node is 100% one class
  • when splitting of node will exceed a maximum depth
  • when imporovements in purity score are below a threshold
  • when no. of examples in a node is below a threshold.
  • 📚Resources
    course:Machine Learning Specialization

Day26

Measuring purity

In decision tree entropy is the measure of level of impurity and helps to find purity of classes. lower impurity means higher purity.
entropy

Information gain

We can calculate the information gain by subtracting the weighted average entropy of the resulting subsets from the entropy of the original node. The formula for information gain is:
Information Gain = Entropy(node) - Σ((subset_size/total_size) * Entropy(subset))

  • By maximizing information gain, decision trees aim to find the attribute that provides the most useful and informative splits, leading to more accurate classification.

Decision tree learning

  • Start with all examples at the root node
  • Calculate information gain for all possible features, and pick the one with the highest information gain
  • Split dataset according to selected feature, and create left and right branches of the tree Keep repeating splitting process until stopping criteria is met:
  1. When a node is 100% one class
    1. When splitting a node will result in the tree exceeding a maximum depth
    1. Information gain from additional splits is less than threshold When number of examples in a node is below a threshold
      dc tree

Recursive splitting

Recursive splitting refers to the iterative process in decision tree construction where a dataset is divided into smaller subsets based on specific conditions. It involves recursively selecting attributes to split on and creating branches that further partition the data until a stopping criterion is met, resulting in a tree-like structure.

Day27

One hot encoding

If a categorical features can take on k values, create k binary features(0 or 1 values) is call one hot encoding.

Spitting for continuous variable

For continuous variable we have to choose threshold with higher information gain and split on the basis of that threshold.

Regression tree

It is a decision based tree used to predict continous variables.

  • for selecting best split for regression tree we find variance reduction and the biggest variance reduction is considerd to be the best split.
  • Note: Variance reduction = variance of root node - average weighted variance of leaf node
  • Regression tree
  • 📚Resources
    course:Machine Learning Specialization

Day28

Tree ensemble

Single decision tree is very sensitive to data so the process of combining many decision tree to build more robust system is called tree ensemble. the prediciton of tree ensemble is obtained by majority result of tree.

Random forest algorithm

A random forest algorithm is a machine learning technique that combines the predictions of multiple decision trees to make more accurate and robust predictions. It works by creating an ensemble of decision trees, where each tree is trained on a random subset of the data and uses a random subset of features. The final prediction is then made by averaging or voting the predictions of all the trees in the forest. The random forest algorithm is effective at handling complex datasets, handling missing values, and avoiding overfitting.

  • Random forest has two term that explain this algorithm they are Bootstrapping and aggregation and the combination of this is called Bagging
  • Boostrap : The selection of subset of training example (sampling with replacement) where the training example can be repeatded is called bootstrap
  • Aggregation : The selection of majority of result from ensembles tree is called aggregation.

XG Boost

In XG boost we basically pick the training examples that were misclassified previously instead of training all samples. It is implemented as :
xg boost img

When to use decision tree

  • Decision tree works pretty well on structured data but is not recommended for unstructured data like audio,video and images
  • It is faster compared to neural network
  • Small decision trees may be human interpretable.
  • 📚Resources
    course:Machine Learning Specialization

Day29

Unsupervised learning

Machine leanring algorithm that find patterns on unlabelled data.

K-means Clustering Algorithm

K-means clustering is an unsupervised machine learning algorithm used for partitioning a dataset into K distinct non-overlapping clusters. Each data point in the dataset is assigned to the cluster with the nearest mean (centroid). The algorithm aims to minimize the within-cluster variance, also known as the "inertia."

Here's a step-by-step overview of the k-means clustering algorithm:

  • Initialization: Randomly select K data points from the dataset as the initial cluster centroids.

  • Assignment: Assign each data point to the nearest centroid. This is done by calculating the Euclidean distance (or other distance metrics) between each data point and each centroid, and assigning the data point to the cluster with the closest centroid.

  • Update: Recalculate the centroids of each cluster by taking the mean of all the data points assigned to that cluster.

  • Repeat: Repeat steps 2 and 3 until convergence or a maximum number of iterations is reached. Convergence occurs when the centroids no longer move significantly between iterations or when the algorithm reaches the predefined maximum number of iterations.

  • Final Clusters: Once convergence is achieved, the algorithm outputs the final cluster assignments, where each data point belongs to one of the K clusters.
    k means cluster

Cost function for k-means clustering(distortion)

The cost function for k-means clustering is commonly referred to as the "inertia" or "within-cluster sum of squares." It measures the sum of squared distances between each data point and its assigned centroid within each cluster. The goal of k-means clustering is to minimize this cost function.

Mathematically, the cost function (J) for k-means clustering can be defined as:

J = Σᵢ Σⱼ ||xᵢ - μⱼ||²

Day30

Anamoly detection

Anomaly Detection is the technique of identifying rare events or observations which can raise suspicions by being statistically different from the rest of the observations. Such “anomalous” behaviour typically translates to some kind of a problem like a credit card fraud, failing machine in a server, a cyber attack, etc.

  • To apply anamoly detection we use gassuian distribution/normal distribution

Anamoly detection vs Supervised learning

Anamoly detection

We will use anamoly detection when there are very small number of example that are positive(anamoly) and large number of negative example .Since small number of positive examples it is hard to learn from training examples.e.g: fraud detection

Supervised learning

We will use supervised learning when large number of positive and negative examples are present. Since there are enough positive examples to train model and predict supervised learning is effective. e.g Email spam detection

Day31

Recommender system

Recommender systems are a type of machine learning algorithm used to suggest items to users based on their preferences or behavior. These systems are widely used in various applications like e-commerce, movie streaming platforms, music apps, and more.

1.Content based Recommendation

In content-based recommender systems, the algorithm recommends items based on the similarity between the content/features of the items and the user's preferences. The similarity is typically computed using techniques such as cosine similarity or Euclidean distance. Here's an overview of the mathematical steps involved:

  • a. Feature Representation: Each item and user is represented as a feature vector. Let's denote an item's feature vector as x and a user's preference vector as p. These vectors consist of numerical values that represent the attributes or characteristics of items or users.

  • b. Similarity Measure: The similarity between two feature vectors, x and p, can be computed using cosine similarity. The cosine similarity between x and p is defined as:

similarity(x, p) = (x · p) / (||x|| * ||p||)

where (x · p) represents the dot product of vectors x and p, and ||x|| and ||p|| denote their respective Euclidean norms.

  • c. Recommendation: To recommend items to a user, the system calculates the similarity between the user's preference vector and the feature vectors of all items. The system then ranks the items based on their similarity scores and recommends the top-rated or most similar items.
  • 📚Resources
    course:Machine Learning Specialization

Collaborative Filtering Recommender Systems:

Collaborative filtering recommender systems make recommendations based on the preferences or behavior of other similar users or items. Let's explore the two main approaches: user-based and item-based collaborative filtering.

  • a. User-based Collaborative Filtering:

In user-based collaborative filtering, the algorithm finds similar users based on their past interactions or ratings and recommends items that the similar users have liked. Here are the mathematical steps involved:

i. User Similarity: The similarity between two users, u and v, can be computed using techniques such as cosine similarity or Pearson correlation. The similarity score measures the likeness of their past interactions.

ii. Prediction: To predict a user's preference for a particular item, the system combines the ratings of similar users. The predicted rating, denoted as r_hat(u, i), for user u and item i is calculated as a weighted average of the ratings of similar users:

    r_hat(u, i) = ∑ (sim(u, v) * r(v, i)) /|sim(u, v)|
#where sim(u, v) represents the similarity between users u and v, r(v, i) denotes the rating of user v for item i, and the summation is performed # over all similar users v.

where sim(u, v) represents the similarity between users u and v, r(v, i) denotes the rating of user v for item i, and the summation is performed over all similar users v. iii. Recommendation: The system recommends items with the highest predicted ratings to the active user.

  • Item-based Collaborative Filtering:

In item-based collaborative filtering, the algorithm identifies similar items based on the past interactions or ratings of users. It then recommends items that are similar to the ones the user has already liked. Here's a summary of the mathematical steps involved:

i. Item Similarity: The similarity between two items, i and j, can be computed using techniques such as cosine similarity or Pearson correlation. The similarity score measures the likeness of user preferences for the items.

ii. Prediction: To predict a user's preference for a particular item, the system considers the user's past ratings for similar items. The predicted rating, denoted as r_hat(u, i), for user u and item i is calculated as a weighted average of the user's ratings for similar items:

  r_hat(u, i) = ∑ (sim(i, j) * r(u, j)) /|sim(i, j)|

  #where sim(i, j) represents the similarity between items i and j, r(u,

Day32

Normalization

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.

Limitation of collabrative filtering

How to

  • rank new items that few user have rated ?
  • show something reasonable to new users who have rated few items ? How to use side information about items or users:
  • Item: Genre, movei stars, studion...
  • User: Demorgraphics(age,gender,location), epressed prefernces,.. Note: This limitation of collabrative filtering can be addressed by content based filetring

Content base recommendation

It uses both content and user data and using neural network create vector for content and vector for user and its dot product give prediction.

Content base recommendation for large items.

When our website or app has large number of content to recommend like thousands and millon of item it is carried out in two steps:

1.Retrival

From large number of content retrival is carried out for selective content for further ranking. for e.g:
For movies recommendation:
1.for 10 movies watched by user retrieve similar movies.
2.for most viewed 3 genres find top 10 movies.
3.find top 20 movies in country.
At last combined retrived item in list and remove duplicated and items already purchased.

2.Ranking

Apply model to retrived data to find suitable item and display ranked item to user.
Note: Retriving more items result in better recommendation but takes more time to analyse try it offline and find suitable number of retrival for better and relevant recommendations.

Day33

Dimensionality reduction

Dimensionality reduction is a technique used to reduce the number of features in a dataset while retaining as much of the important information as possible. In other words, it is a process of transforming high-dimensional data into a lower-dimensional space that still preserves the essence of the original data

  • Feature selection : Feature selection involves selecting a subset of the original features that are most relevant to the problem at hand. The goal is to reduce the dimensionality of the dataset while retaining the most important features. There are several methods for feature selection, including filter methods, wrapper methods, and embedded methods. Filter methods rank the features based on their relevance to the target variable, wrapper methods use the model performance as the criteria for selecting features, and embedded methods combine feature selection with the model training process.

  • Feature Extraction: Feature extraction involves creating new features by combining or transforming the original features. The goal is to create a set of features that captures the essence of the original data in a lower-dimensional space. There are several methods for feature extraction, including principal component analysis (PCA), linear discriminant analysis (LDA), and t-distributed stochastic neighbor embedding (t-SNE). PCA is a popular technique that projects the original features onto a lower-dimensional space while preserving as much of the variance as possible.

Principal Component analysis(PCA)

Principal component analysis, or PCA, is a statistical procedure that allows you to summarize the information content in large data tables by means of a smaller set of “summary indices” that can be more easily visualized and analyzed.

  • It works on the condition that while the data in a higher dimensional space is mapped to data in a lower dimension space, the variance of the data in the lower dimensional space should be maximum.
  • 📚Resources
    course:Machine Learning Specialization

Day34

Step by step calculation of PCA

PCA is used to reduce higher dimension data to lower dimension without losing it essence. PCA can be calculated in following steps:

  • Mean centring data
  • Finding covariance matrix
  • Finding eigen value and eigen vector
  • Eigen vector with largest eigen value has highest variance and is ready for selction.
  • PCA
    After applying PCA to handwritten digit having 784 features we got optimal solution of pca at around 250 that explains nearly to 90% of variance.
    optimal solution of PCA

2D plot of 784 feautres

|2d plot of 784 features

3D plot of 784 features

3d plot of 784 features

Day35

Reinforcement Learning

Reinforcement learning is a machine learning training method based on rewarding desired behaviors and/or punishing undesired ones. In general, a reinforcement learning agent is able to perceive and interpret its environment, take actions and learn through trial and error. It reward is calculated as:
reward png

Markov Decision Process(MDP)

It state that future depends on current state .
MDP

Day36

State action value function

In state action value function, represented by Q(s,a)
Q(s,a) = Return , If you

  • start in state s
  • take action a(once)
  • Then behave optimally after that Note: Behaving optimally means taking action which bring maximum Q(s,a).
    Implementaion of state action value function: State action value function img

Bellman equation

Bellman equation explain the return in two step first one is immediate reward and second one is reward from behaving optimally starting from state s.

Random stochastic environment

Due randomness and uncertainity in enviroment it becomes diffcult for reinforcement learning so to avoid this we caluclate Expected return(i.e average return) in placce of return only .
It is calculated as : Q*(s, a) = E[R(s, a, s') + γ ∑ P(s'|s, a) max(Q*(s', a'))],

Day37

Discrete state and continous state

  • Discrete state : It is a state in reinforcement learning where the number of possible state are distinct and countable. Discrete state spaces are common in many RL problems, such as board games, puzzle-solving, and decision-making tasks with a finite number of states.
  • Continuous state : It is a state in reinforcement learning where the number of possible state are in continous range . Continuous state spaces are encountered in various RL applications, including robotics, control systems, and real-world scenarios where states are represented by continuous measurements.

Differences in Discrete and continous state

Discrete state spaces can often be represented using tabular methods, where the agent maintains a value function or a Q-table to learn and update action values for each state. On the other hand, dealing with continuous state spaces often requires function approximation techniques, such as using neural networks, to approximate the value function or policy. Continuous state spaces also pose challenges for exploration strategies, as the agent needs to explore a potentially infinite space effectively.

Exploitation(greedy) VS Exploration

It also know as epsilorn greedy policy

  • with probability 0.95, pick the action a that maximizes Q(s,a) - Greedy (Exploitation)
  • with porbability 0.05, pick action a randomly. (Exploration)
  • This means epsilon = 0.05
  • Start with high epsilon and decrease gradually

Refinement of reinforcement learning by minibatches and softupdate

Mini batches

When we have large number of training examples our iterative process like gradient descent and other iterative process on reinforcement learning like trainin neural network becomes slower so we divide main training examples to differen subsets called mini batches

Soft Update

Soft update in reinforcement learning refers to a technique used to refine the learning algorithm by updating the parameters of a target network gradually. This process involves interpolating between the parameters of the target network and the parameters of the online network.

Day38

Building of Books recommender system using Collabrative filtering.

Collaborative filtering recommender systems make recommendations based on the preferences or behavior of other similar users or items. In this book recommendation system I calculated similarity scores between two users to find the euclidean distance and recommendation was made on the basis on two nearer items and most of the collaborative recommender system work like this.

To avoid cold start in collabrative filtering:

  • The movie with more the 50 rating was included.
  • The user who have rated more than 200 books was included. Here is snippet of the project hope you get some insight riding this
    book recommender system

Day39

Califorina Housing Price prediction

Started my project on California housing price and cleared my concepts on data pipeling, Batch learning and Online learning, perfomed the fetching and loading of data with EDA to gain some insight on data.
NOTE

  • Batch learning vs Online learning : Batch learning is commonly used when the dataset can fit into memory and when the computational resources are sufficient to process the entire dataset at once. It is often used in offline scenarios where there is no need for real-time or incremental learning.
    Online learning is particularly useful when dealing with streaming data or when the dataset is too large to fit into memory. It allows for real-time learning and adaptation as new data arrives. Online learning algorithms often have lower memory requirements and can adapt to concept drift, which is the phenomenon where the underlying data distribution changes over time.
  • Cost for linear regression : Most of the time cost for linear regression is calculated by Root Mean Square Error(RMSE) but when data has lots of outlier we use Mean Absolute Error (MAE) calforinaday1 Eda result
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day40

California Housing Price Prediction continued..

  • Today I Created test data , and splitted data on the basis of train-test-split and also with stratifcation split to remove imbalance in data and create same proportion.
    Creating test data testdata Creating Training and test data with random sampling and stratification sampling stratification

Different methods of sampling :

1.Random sampling 2.Systematic sampling 3.Stratified sampling

  • Stratified sampling: Stratified sampling can be useful in train-test split when dealing with imbalanced datasets. In this case, the dataset may have significantly different proportions of classes or subgroups. By using stratified sampling, we can ensure that the training and test sets maintain the same distribution of classes or subgroups as the original dataset. This helps to prevent bias and provides a more accurate evaluation of the model's performance.
    sampling in train test split

Day41

California Housing Price Prediction

  • Visualization with geographical data : I plotted geographical visual with respect to population density and housing price to gain better understanding of data
  • Correlation: Also plotted scatterplot for coefficient of correlation of median_housing price with respect to different features and found out that it has high correlation with median_income but there was some straight line forming in middle of data which need to be filtered before training for better performance.
    correlation fig Here is code hope you gain some insight from it :
    visualization correlation code
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day42

California Housing Price Prediction

Expermenting with attribute combinations

  • I created some new combination of feature like room per house , bedroom ration, population per house and found that room per house has done well then other features, it got some high negative correlation that indicated the less bedroom ratio more the price.
  • Also Prepared data for machine learning algorithm by separating the features and target, and perform cleaning of data, replced missing values by filling it with median as it is less destructive. code for data cleaning

Use of Simple Imputer

  • The benefit of using SimplImputer is that it will store the median value of each feature: this will make it possible to impute missing values not only on the training set, but also on the validation set,the test set, and any new data fed to the model. photo of simple imputer use

Handling of text and categorical data

  • Text and categorical data can be handled by using ordinal encoder and One Hot encoding but incase of ordinal encoder it think data nearby data are more similar than far data which is not the case in Oceanproximity so we use onehot encoding. handling text and categorical data

  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day43

Feature Scaling

  • Feature scaling is one of the most important transformation you need to apply to your data. Without feautre scaling most model will bias one feature with another. Two ways of feature scaling are : 1.Min-max scaling 2.Standarization.
  • Never use fit() or fit_transform() for anything else than training set. feature scaling image

Bucketing/Binning

The transformation of numeric features into categorical features, using a set of thresholds, is called bucketing (or binning)

Day44

Column Transformer

  • Column transformer is a versatile tool in machine learning that allows for the application of different preprocessing steps to different columns or subsets of columns in a dataset. It simplifies the preprocessing workflow, enhances reproducibility, and improves the efficiency of feature engineering in machine learning tasks.

Pipeline

  • Pipeline refers to a sequence of data processing steps that are applied in a specific order. It combines multiple operations, such as data preprocessing, feature engineering, and model training, into a single cohesive workflow. It make easier to apply same preprocessing to training and test set
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day45

Select and train a model

  • I trained some model like LinearRegression, DecisionTreeRegressor and RandomForestRegressor and found out RMSE very high in LinearRegrssion which indicated underfitting and RMSE 0 in DecisionTreeRegressor which was heavily overfitting and RMSE was comparatively low on RandomForestRegressor.So, I find that RandomForestRegressor can be a good choice selecting model

Evaluation of CrossValidation and Fine tunig the model

  • Performing CrossValidation also showed that Random forest was good choice despite of some overfitting and After some tuning in RandomForestRegressor using GridSearch CV I got some good hyperparameter and model perform more better than before and RMSE was also reduced. last day code
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day46

Classification

  • Today I dive deep into classification and revised some topic using MNIST like binary classifier, measuring accuracy using cross validation, confusion matirces, precision and recall, precision recall tradeoff.
  • Measuring accuracy is more complex in classification than in regression some of the methods of measuring classification accuracy are : Using cross validation, confusion matirces, precision and recall, ROC curve. The below code show implementation of measuring accuracy in classification using different method in MNIST dataset hope you gain some insight:
    da46_measuring_accuracy
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day47

  • Today I revised my classification concept on Multiclass Classification, error analysis, Multilabel Classification, Multioutput Classification.
  • Multiclass classifier are capable of handling more than two class. Some of the algorithm that does multiclass classification are: LogisticRegression, RandomForestClassifier and GaussianNB. Also can be done using multiple binary classifier like SGDClassifier and SVC.
    1.One Vs All (OVA/OVR): In One Vs All classification there are N Classifier model for N classes. Model with highest score is selected for particular class calssification.
    2.One Vs One (OVO): In One Vs One classification N classes have N*(N-1)/2 classifier model. In this, we have one classifier model for each class against every other class.
  • Also performed error analysis using confusion matrix
  • Implemented Multilabel classification : A classification system that outputs multiple binary tags is called a multilabel classification system.
  • Multioutput Classification : It is a classification that has multilabel and each label can have multiple class. Below is the implementation of what I have learned today. classification Classification implemetation code
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day48

  • Today I dived deep into training model and what happen underhood the sickit learn model with help of simplest Linear regression model and learned about minimizing model cost function using gradient descent.
  • Gradient descent is used to find best parameter for reducing cost function. tips:while using gradient descent all features most have same scale e.g use standardscaler or it will took very long time to converge. Batch gradient descent takes whole batch of training data at each steps so it is terribly slow during large datasets.
  • To solve the problem of Batch gradient descent we use stochastic gradient descent as named suggest step of this gd is random and stochastic gradient descent picks a random instance in the training set at every step and computes the gradients based on it rather than taking whole data. But since it is random the steps may never settled to optimal minimum so this can be solved by decreasing learning rate gradually know as simulated annealing. To use stochastic gradient descent with linear regression we can use SGDRegressor.
  • Mini-batch GD com‐ putes the gradients on small random sets of instances called mini-batches. The main advantage of mini-batch GD over stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPU gd code Implementation of code
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day49

  • Today I revised my concept on polynomial regression, learning curve, and about the condition of underfitting, overfitting and it solution.
  • Polynomial regression model is used for non-linear data, we used cross-validation to see how our model is performing to check whether it is underfitting or overfitting, it can be also checked in polynomial regression with the help of Learning curve. If the model is underfetting it is better to add some new features or to use different model but adding more data doesnot help in underfitting and IF the model is overfitted it can be solve by regularization. polynomial
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day50

  • Today I learned about Ridgeregression also known as L2 regularization used to solve overfitting problem it can be aslo used with SGDRegressor directly by putting l2 in penalty and don't forget to divide alpha with m. ridgregression
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day51

  • Today I learned about Lasso Regression also known as L1 regularization and elastic net regression. Lasso Regression is used in case of high dimensional data and it can perform dimensionality reduction by setting coefficient zero of less important feature on increasing regularization paramter lambda which cannot be done by ridge regression.
  • Elastic Net regression : This regularization term is the middle ground of both Ridg and Lasso it is calculated as weighted sum of both ridge and lasso regularization term, elastic net is preffered when more features have strong correlation or number of features is greater than training instances.
  • Early stopping : A very different way to regularize iterative learning algorithms such as gradient descent is to stop training as soon as the validation error reaches a minimum. This is called early stopping.

regularization

Day52

  • Today I revisied my concept of logistic regression and softmax regression, logistic regression is used for classification problems and has costfunction called logloss, whereas softmax regression is used in case of multiclass classification.I spend sometime using logistic regression to classify iris flower based on petal length whether it is virginica or not and observed it decision boundary and used softmax regression to do same. You can gain some insight on implemenation of what I learned : classfication
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day53

  • Support Vector Machine: SVM is a popular machine learning model that does linear or non-linear classification, regressio and even novelty detection and it perform well with small and medium datasets.The core idea of SVM is :
    1.Start with lower dimension data
    2.Move data in higher dimension
    3.Find a support vector classifier that separates the higher dimension data.
  • Kernel function : It is function that transform data from lower dimension to higher dimension to find support vector classifiers. some of them are : Linear kernel, polynomial kernel, Radial bias function (RBF) kernel and sigmoid function.
  • Kernel trick : Calculating the high dimensional relationships without actually transforming the data to higher dimension is called kernel trick. It reduces the mathmatical computation. Implementation of SVM
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day54

  • Polynomial Kernel : The polynomial kernel is a kernel function that calculates the similarity between two data points in a feature space using a polynomial function. It is defined as: K(x, y) = (α * x^T y + c)^d formulas
  • RBF Kernel : The RBF kernel, also known as the Gaussian kernel, is a popular kernel function that measures the similarity between data points based on their radial distance in a feature space. It is defined as: K(x, y) = exp(-γ * ||x - y||^2)
  • Both the polynomial kernel and the RBF kernel leverage the kernel trick, which is a method used in machine learning to implicitly transform data into a higher-dimensional feature space without explicitly calculating the transformed features. The kernel trick allows algorithms to efficiently operate in this higher-dimensional space by only computing the kernel function values between data points. code
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day55

  • Today I continued my learning on SVM, SVM classes like LinearSVC adn SGDclassifier doesnot support kernel trick but it is supported by SVC class.SVM can also be used to solve regression problem by tweaking parameter epsilon, width of margin can be increased by increasing epsilon.
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day56

  • Today I spended my time revising concept on decision tree from book and implementing it. Decision tree can perform both classification and regression task and are the fundamental unit of random forest which is one of the most important algorithm in machine learning.Decision tree uses entropy and gini impurity to look how model is doing.Gini impurity is default in decision tree classifier and best in case of large dataset as computationally it is faster than entropy while entropy provide more balanced classes.
  • Regularization : Regularization can be done in decision tree by reducing freedom to decision tree and controlling following parameter like max_leaf_nodes:Maximum number of leaf nodes, min_samples_split: Minimum number of samples a node must have before it can be split, min_samples_leaf: Minimum number of samples a leaf node must have to be created, min_weight_fraction_leaf:Same as min_samples_leaf but expressed as a fraction of the total number of weighted instances.
  • Increasing min parameter and reducing max parameter regularize decision tree. code graph
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day57

  • Decision Tree for regression: Decision tree for regression is also similar to classification decision tree it recursively create split until the pure class is obtained. The selection is split is done by information gain using variance reduction.
  • Also learned to reduce max_depth and increase min_sample_split and min_sample_leaf to overcome overfitting in decision tree of regression.
  • Sensitivity to axis rotation : Decision tree does well when split is perpendicular to axis but when split is not perpendicular to axis data may not generalize well so we use scaling and pca to transform data and genralize it.
  • Hyperparameter tuning in decision tree result in high variance which is solved by averaging many decision tree known as ensemling and such ensemled tree is random forest which one of the popular and widely used ml algorithm. Implementation of them is given below hope you get some insight. regression decision tree
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day58

  • Ensemble learning: Today I revised my concept on ensemble learning and I have to say that this is the beauty of machine learning. I enjoyed topics like Wisdom of crowd that hold major concept of ensemble learning (wisdom of crowd = If we average the decision of large crowd we get our result which is close to actual result.) and learned about voting classifier, bagging, boosting and stacking .
  • Voting Classifier: voting classifier train different model and either get prediction based on majority vote called hard voting or get prediction based on probability of each model prediction and averaging it called soft voting. for soft voting we have to set hyperparameter as 'soft' and for svc you have to set probability hyperparameter to True. Implementation of my learning is given below hope you get some insight from it.
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day59

  • Bootstrap Aggregation: In this method of ensemble learning whole data is divided into different sample and selected without replcement known as bootstrap and different model is feded with different sample taken and the result of all model is averaged for regression and voting classifier is done for classfication known as aggreagation. And the data missed in sample is known as oob(out of bag). Implementation of bagging:
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day60

  • Random patches : Random patches means taking sample of both features and training instances.
  • Random subspaces : Random subspaces means taking sampel of features but using all training instances. This techniques are used for higher dimensional data like images day60
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day61

Random Forest

  • Today I revised my concept on random forest which is generally the ensemble of decision tree based on bagging concepts. It can be used by importing RandomForestClassifier or RandomForestRegressor according to requirement.Also random forest are very good with feature importance and tell which feature has how much importance , it can be handy for feature selction and it score can be accessed by feature_importances variable.
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day62

Boosting

  • Boosting basically trains the predictors sequentially each trying to correct it predecessor.Some of the popular boosting method are Ada boosting and gradient boosting.
  • Ada boosting : Ada boost focuses more on the training instances that the predecessor underfit sequentially and works on updated weight. If ada boositng overfit it can be regularized by reducing the number of estimator or number of boosting stages.
  • Gradient boosting : It also works sequentially but instead of updating training instances weight it works by fitting the predecessor on residual error of previous model. Implementation of boositing is shown below hope you gain some insight. boosting
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

    Day63

    History based gradient boosting

    • HistGradientBoosting is faster than gbrt and it has two features that it allows missing values and categorical features. It is implemented as : hgb

    Stacking

    • In this ensemble methods there are different model for predicting result based on training instances and the aggregation of this model prediction is done by another final model known as meta learner or blender. The result of base predictor will be feature for blender and the target of trainig instances will be reused to blender model. stacking
  • Implementation of ensemble learning
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day64

Dimensionality reduction

I have already mentioned and studied about dimensionality reduction technique using pca today I am going to revised and implement it. Dimensionality reduction is process of reducing dimension of data without losing it essence and it is done to speed up training and somtime to reduce noise. Two methods of dimensionality reduction are projection and manifold learning.

PCA

It is the most popular technique for dimension reduction it reduce dimension by choosing hyperplane that is closest to the data by preserving variance. You can find the number of components/features to use by setting ratio of variance to preserve ideally 95 % in parameter n_components. You can gain some insight form below code. pca

Day65

LLE(Locally linear embedding)

  • It is a non-linear dimensionality reduction technique based on mainfold learning unlike PCA which is based on projection. LLE

K-means

  • It is a unsupervised learning algorithm that make clusters of similar data and it has many use cases like image segmentation, anamoly detection, customer segmentation , data analysis and so on.I have already discussed its theory and practical implementation is given below hope you gain some insights.
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day66

Supervised learning in Neural network

  • Today I get started with new course deep learning specialization and learned about supervised learning using neural network. I learned to implement logistic regression which is classification algorithm using neural network. Traditional algorithm after some amount of data doesnot show increment on performance but with more data you can achieve higher performance by increasing neural network size
  • Also learned about backpropagation which is the algorithm use to perform gradient descent in neural network to update the weight and bias in such a way cost function is minimized.
  • 📚Resources Deep Learning Specialization |
    Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day67

Image classifier using sequential api

  • Today I learned to use sequential api of keras to create a model that classify mnist fashion dataset clothe images. I trained the model using sequential api with input layer that take image as input and convert it to 1D array and two hidden layer that has relu activation and third output layer with 10 classes and activation softmax for multiple classification and I compile the model and evaluated it and found around 88% accuracy and checked on frist 3 data which was predicted correct. Here is the code hope you gain some insight watching it. day67 done
  • 📚Resources Deep Learning Specialization |
    Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day68

  • Today I revised concept of vectorization, broadcasting in python and not using of rank 1 array in neural network. Vectorization is done with the help of numpy and it make computation faster by avoiding loop to run in C instead of python. Broadcasting helps to perform arithmetic operation between unsimilar shape vector and also learned to avoid rank 1 array in neural network instead use vectors with rows and column because unexpected broadcasting may happen in rank 1 arrray.These were some basic to clear before diving into deeplearning hope you get some insight from here. basic
  • 📚Resources Deep Learning Specialization |
    Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day69

  • Today I learned to use functional api to create complex model, I used keras api that formed deep neural network architecture, it contained input layer to take input and the input was normalized by normalization layer, there were two hidden layer with 30 neuron having relu activation and concattenation layer to concatenate both second hidden layer and normalized layer and finally there was hidden layer or output layer to give prediction. code

Day70

  • Today I gained some overview on different types of neural network (NN) like ANN(Artifical Neural Network), Convolutional Neural Network (CNN), Recurrent Neural Network(RNN), Auto encoders network , Genrative Adversial Network(GAN).I also learned in detail about perceptron that is mathematical model of neuron and implemented it as binary classifier. perceptron implementation

Day71

Deep learning foundation

Perceptron trick

  • Today I learned about simple perceptron trick in which random data is selected each time and classifer is moved if data is misclassified, but in cannot quantify the classifier and may not converge sometime so we use different lossfunctionn in perceptron

Loss function in Perceptron

  • Different loss function like perceptron loss,hinge loss, relu to minimize loss function in neural network we use gradient descent as done previously in machine learning

Gradient descent

  • In neural network gradient descent finding is not easy task due to local minima we have to choose correct learning rate to converge so we use a technique called adaptive learnign that chosses learning rate some of them are SGD,adam,RMSprop,etc.

Backpropagation

  • This is done to minimize the loss by updating weights on the basis of gradient descent it utilize chain rule of calculus. In backpropagation we see rate of change of loss with change in weights. It optimizes neural network.

Batching and regularization

Day72

Recurrent neural network (RNNS)

  • RNNs are special neural network that takes sequential data like text,audio as input and apply a recurrence relation to process the sequence, RNNs uses same weight matrices everytime ,which allow them to maintain and utilize information from previous time steps while processing the current input. Rnnsimage

Encoding

  • Encoding refers to the process of transforming an input sequence into a fixed-length vector representation or a hidden state that captures the information from the entire sequence. encoding

Embedding

  • Embedding refers to the process of representing categorical or discrete inputs, such as words in natural language processing (NLP), as continuous-valued vectors. Embeddings help capture the semantic relationships and similarities between different input categories.

Simple RNN example :

Day73

Backpropagation through time (BPTT)

  • In RNN backpropagation involves finding the derivatives of the loss function with respect to each time step in the RNN sequence, and then propagating these gradients backwards through time to update the weight called BPTT.

Gradient Issues

  • Computing gradient in rnns can be challenging task because it computing gradient wrt to time state involves many factors of weight repeated gradient computation.
  • If weight matrices is large (many values>1) explodint gradients happen and it is solved by:
    • gradient clipping : One common solution to address exploding gradients is gradient clipping, which involves setting a threshold for the gradients. If the gradients exceed the threshold, they are scaled down to maintain their magnitude within an acceptable range.
  • If weight matrices is small (many values<1) vanishing gradients happen and it solved by :
    • Activation function : Using activation function like relu make derivative 1 whenever x>0 so it help to prevent vanishing gradient.
    • weight initialization : Using Identity matirces as weight and keeping biases o prevent vanishing problem.
    • Network architecture : Controlling which information to add and which information to remove using gates in each recurrent network. i.e use of LSTM and GRU.
  • 📚Resources Introduction to Deeplearning

Day74

LSTM

  • Long Short Term Memory(LSTM) it use gated method that selectively add or remove information with each recurrent unit. This architecture solve gradient vanishing problem. It use gate to control flow of information by
    • Forget : Get rid of irrelevant information
    • Store : Store relevant infromation form current input.
    • Update : Selectively update the cell state.
    • Output : Return a filtered version of the cell state.

Self Attention

  • Self attention hold the core concepts behind modern transformer based model . It main idea is Attending the most important part of input. It captures long-range dependencies and allow parallelization. self attention image
  • 📚Resources Introduction to Deeplearning

Day75

Convolutional Neural Network(CNN)

CNN are deeplearning algorithm that takes images and videos as input and perfrom further analysis and processing on it trying to mimic human visual system.

  • Convolutional layers: In CNN we use filters of features patches to scan images called convolution. Convolution is element wise matrix multiplication of image patch and filter and summing it up which produces feature map. e.g: In letter 8 we use filters of circle,For different feature of image we use different filters.
  • Activation functions: After forming feature map we apply activation like Relu to introduce non-linearity in the network.
  • Pooling layers: after activation function we use pooling to spatial dimensions which make computational faster, make network more robust for varations in input.
  • Flattening : After downsampling of feature map we convert it into 1-D vector and fed it to conncected layer for classification.
  • Training : The 1-D data is fed to conncected neural network and model is tarined for further prediction and classification. While training model large sample labelled data is used and model learn through backpropagation. To build more robust system we use image augmentation to generate different variation of image for training by scaling,rotating,distorting images cnn
  • 📚Resources Introduction to Deeplearning

Day76

  • Today I dive depper into each steps of CNN and learned more about parameter of each component and implemented it to create a simple CNNs.
    • I firstly created feature map using Conv2D with 32 filter and filter of 2x2 use relu for non-linearity and downsampled it by using MaxPool2D of size 2x2 and strides i.e movement of 2.
    • In second convolution I used 64 filter of 3x3 with relu for non-linearity and downsampled it using 2x2 pool with stride 2.
    • In last step I flatten the feature map into 1D array and use connected layer to make classification using softmax as activation function. Hope you get more understanding by looking at code and architecutre of CNN simple implementation. cnn_arch cnn
  • 📚Resources Introduction to Deeplearning

Day77

R-CNN

  • CNN can only classify one object at a time called image localization to solve this issue sliding window approach was issued which was heavy computationally because it create million of sliding window in a normal image. For solution of this Regional based CNN i.e R-CNN was issued which proposes some region using external algorithm called selective search and Convolution is performed in this region.
  • To ensure accurate localization and avoid redundant bounding box proposals, R-CNN applies bounding box regression and a technique called non-maximum suppression (NMS). r-cnn
  • 📚Resources Introduction to Deeplearning

Day78

Fast R-CNN and Faster R-CNN

  • Fast R-CNN is also object detection algorithm that is faster than R-CNN, because in R-CNN propose region i.e some thousand region were send to CNN but in fast R-CNN input image is send to CNN and after CNN it proposes region from feature map by using selective search algorithm and Region of Interest pooling is done and further prediction is performed.
  • But In cases of Faster R-CNN image is send to CNN like fast-RCNN but selective search algorithm is not used instead of it Region based Proposed Network is used to detect object which was faster than selective search algorithm,.

YOLO algorithm

  • You Only Look Once , In this algorithm we take an image and split it into an SxS grid, within each of the grid we take m bounding boxes. For each of the bounding box, the network outputs a class probability and offset values for the bounding box. The bounding boxes having the class probability above a threshold value is selected and used to locate the object within the image. yolo

Day79

Bird Species Classification using CNN

  • Today I worked on Bird Species classfication using CNN which take bird image as input convert it to array , normalize it and passes it CNN which perform classification into one of the 6 classes. My model got an accuracy of around 89% on test data. The model architecutre and snippet is shown below:

Day80

Generative modeling

  • Generative modeling take training sample as input with some distribution and predict or generate new sample similar to that distribution.

Auto encoders

  • Automatic encoding or auto encoders simple change higher dimensional data into lower dimension latent space and reconstruct the data using decoder , provides lower dimension data, denoised data.

Variational Auto encoders

  • VAEs are atuo encoders that has probablistic twist on traditional auto encoders, VAEs use mean and standard deviation to learn latent space based on gaussian distribution and decode the latent space to generate new sample data. vaes
  • 📚Resources Introduction to Deeplearning

Day81

Prior on the latent disribution

  • Prior on the latent distribution means our assumption on how the data will be distibuted for latent variable, common choice will be gaussian distribution with mean= 0 , variance = 1 becaue it encourages even encoding on latent space and penalizes on the clusering of data i.e avoid memorization of data.

Regularization and normal prior

  • In vaes regularzation is done to obatined continuity : points that are close in latent space are consider similar after decoding and completeness: sampling from latent space turn into meaningful content after decoding . regularization and normalprior

Reparametrization

  • In vaes reparamerization is done to allow backprop and gradient descent for training of vaes end to end.

latent peturbation and entanglement

  • Keep other variable fixed and increase or decrease single latent variable is latent peturbation and Latent entanglement refers to the interdependencies or correlations among the latent variables in a generative model. In some cases, the latent variables may be entangled, meaning that changing one variable can have an impact on multiple aspects or features of the generated output. peturbationn vaes
  • 📚Resources Introduction to Deeplearning

Day82

Genrative Adversial Network (GAN)

gan

  • GAN is made up of two model i.e Generator and Discriminator and they behave adversial role to eachother , where
    • Generator produces synthetic data samples from noise which is similar to real data and Discriminator takes real data and data produced by discriminator as an input and differentiate between fake and real data. Generator and Discriminator works on competitive manner, where generator try to produce good data samples that can fool discriminator and discriminator also improve itself to classify real and fake images. They work on improvement until the data produces by generator becomes as good as real one.
  • After training process you can simply use this generator two generate new images that has been never seen before.
  • 📚Resources Introduction to Deeplearning

Day83

Challenges for robust deeplearning

  • Bias skewed data and Uncertainity i.e model doesnot know the answer can be challenging part for developing robust deep learning models.
  • Algorithmic bias : There can be bias while selection of data i.e some group may be overrepresented while other may be underrepresented, Model bias can be present due to lack of proper benchmark and metrics, Deployment bias due to change in distribution of data overtime, Evaluation bias due to not accounting of subgroups, Intrepretion bias due to human error .
  • Class Imbalance: Class imablance can be another major problem that create biasness in deeplearning model which doesnot lead to robust model.For example, in a medical diagnosis task, the majority class could be healthy patients, while the minority class could be patients with a rare disease. In fraud detection, the majority class could be non-fraudulent transactions, while the minority class could be fraudulent transactions. This can be solved by: class imbalance solution
  • 📚Resources Introduction to Deeplearning

Day84

Debiasing VAES

  • VAES should be debias for better performance and it is done automatically by increasing the sample probability of sparse region of distribution and undersampling dense region data. Debiasing in vaes

Uncertainity

  • Uncertainity is the lack of confidence or ambguity in the predictions made by model due to some noise or incomplete data.
  • Aleatoric Uncertainity : It is the uncertainity on data itself due to noise or randomness and it cannot be reduced with more data or model improvement.
  • Epistemic Uncertainity : It is the uncertainity on model due to incomplete data and can be reduced with more data and model refinement. uncertainity type
  • 📚Resources Introduction to Deeplearning

Day85

Auto encoders

  • Auto encoders are artificial neural networks capable of learning dense representations of the input data, called latent representations without any supervision. Create an stack autoencoders using keras based on mnist fashion dataset that reconstruct images from latent representations. Hope you gain some insight from this code about auto_encoders : auto encoders
  • Reconstructed image by autoencoder result

Day86

Unsupervised retraining using stacked autoencoders

  • In realworld we may find data that is unlabeled mostly and few portion is labeled of that data because labeling data is expensive and timeconsuming process , In such case we can train autoencoder on full data in phase1 and train the classification model on labeled data in phase 2 using parameter generated by autoencoder from lower layer.

Tying weights

When we have symmetrical autoencoder we can tie the weights of decoder layer to encoder layers this halves the number of weights in the model speeding up training and limiting the overfitting. TIE

Day87

  • Today I learned topics like training autoencoders at a time,convolutional autoencoders, denoising autoencoder from books Hands on Machine learning and implemented them hope you get some insight reading short insights and code snippet.
  • Training one autoencoder at a time: We train a first autoencoder with data and the reconstructed data of first autoencoder is sent to second auto encoder , after that the hidden layer of this encoder is stacked and then output layer of this encoder is stacked forminng new stacked autoencoders.
  • Convolutional autoencoders : This autoencoders perform well incase of images it reduces the spatial dimensionality of image and increases the depth i.e feature map of image. Hope you get some insight watching my Convolutional autoencoder model. convoae ** Denoising autoencoders : Autoencoder can simply be useful to recover noisy image or reconstruct full image by denoising.
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day88

  • Today I learned about sparsity autoencoders from the books and implemented it hope you gain some insights reading this
  • Sparsity autoencoders : A sparse autoencoder is a type of neural network that enforces a sparsity constraint on the activations of hidden layer neurons. It encourages most neurons to be inactive, resulting in sparse representations of the input data. By learning meaningful and efficient features, it aids in dimensionality reduction and can be beneficial for various tasks that require feature learning and reconstruction. sparisity autoencoders
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day89

  • Today I learned about variable autoencoders and its implementation on mnist dataset from the book , VAEs are atuo encoders that has probablistic twist on traditional auto encoders, VAEs use mean and standard deviation to learn latent space based on gaussian distribution and decode the latent space to generate new sample data. Its implementation is shown belwo hope you gain some insight reading it.
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day90

  • Today I implemented the Generative Adversarial Network(GAN) from the book using fashion mnist dataset that generated new images of clothes.This model uses a generator to create fake images and a discriminator to distinguish between real and fake images. The GAN is then trained using a custom training loop with alternating training of the discriminator and the generator. The goal is to train the generator to generate realistic images that can fool the discriminator. In this way new realsitic images where generated . Below is the code of gan .
  • 📚Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow

Day91

  • Today I revised all the basic concepts like sigmoid function,sigmoid derivative, image to vector conversion, normalizing rows, softmax function, vectorization, L1 loss, L2 loss from course deep learning specilization. Also started to work on project SMS spam classifier using ensemble learning . day91_rev 📚Resources Deep Learning Specialization

Day92

  • Today I revisited my concept on shallow neural network and implemented logistic regression on neural network to create a cat classification model that identify whether the image is cat or not from scratch , I spended time in creating each function that make logistic regression like sigmoid function, initializing weights, learning weights, gradient descent, prediction and model below is the snippet of code hope you gain some insights : p1 p2
  • 📚Resources Deep Learning Specialization

Day93

  • Today I revisited my concept of optimizer in deep learning , basically optmizer main objective is to decrease loss function so that we get higher accuracy , optimization is done with Gradient descent like Batch GD, Stochastic GD and Mini Batch GD . But this optimization has some problem like finding learning rate, rscheduling learning rate , limitation of control of learning rate in multidimension, local minimum , saddle points where slope becomes zero without reaching optimal minimum points.
  • EWMA(Exponentially Weighted Moving Average) : It is basically weighted average where past data weight get decreasing over time compare to latest data and it is mostly used with time series data. EWMA can be controlled by a paramter alpha , If we increase alpha It give more value to previous data and graph become more stable, If we give decrease alpha our previous data are less weighted and we get moody graph. Optimal alpah is consider to be 0.9 mostly .

Day94

  • Today I learned about stochastic gradient descent momentum i.e (SGD momentum) which is used to tackle non convex gradient descent to find global minima , It is extension of SGD in addition it take a term called velocity which is based on (EWMA)Exponentially Weighted Moving Average of gradient descent it means it move accounting previous gradient descent where latest gradient descent has more weights. It is controlled by a parameter alpha which if 0 act like normal SGD and increasing it increase the velocity , optimal value of this parameter is 0.9 mostly . SGD momentum is lot faster than normal SGD. The problem with this SGD momentum is oscillation of it after reaching global minimum which increases it time to get to optimal solution.

SGD momentum

  • Nesterov Accelerated Gradient (NAG) : This optimizers solve the oscillation problem of SGD momentum by damping the oscillation and it become more faster than SGD momentum. It basically update weight in two steps first by momentum and second by gradient or momentum look ahead which reduces oscillation . Nag
  • Adaptive Gradient(ADaGrad) : AdaGrad is an optimizer which use adaptive learning rate and use in case of different scales of features , sparse feature that is due to majority 0 value in feature which create different change in slope in different direction so gradient first move more toward feature that has high gradient update to reduce this learning rate is divided by term which reduces learning rate and make gradient descent convergence faster toward global minimum but due to the term that divides learning rate Gradient descent never converges to global minimum it reaches near it so due this flaw ADAgrad is not used with complex Neural Network It is only used incase of traditional algorithm but intiuition of it is necessary to learn Adam and RMSprop optimizers.

Day95

  • Today I concluded my optimizers learning by getting deeper intuition about the most use optimizer in deep learning that is Adam optimizer along with one of the best optimizer that can compete with adam i.e RMSprop optimizer.
  • RMSprop : This solves the problem of ADaGrad that was not converging to global minimum by changing the learning rate over time. The learning rate has more weight in recent epochs as compare to previous epochs.
  • ADAM optimizer : It uses the concept of both momentum and learning rate decay i.e divides learning rate decay by moving average and mulitply learning rate by momentum which makes it use to hill down the gradient descent faster like momentum and get to global minimum without oscillation like learning rate decay concept. ADAgrad

Day96

  • Today I dive deep into regularization of neural network and got more idea about L1 and L2 regularization , how regularization reduces overfitting , Dropout regularzation, data augementation , early stopping for reducing overfitting and orthogonalization from course Deeplearning specialization .
  • Regularization : It is use for reducing overfitting of model most popular way is L2 regularization which penalizes weight for being higher and reduces some of the hidden unit effect. It make model more linear and reduces complex curve which result in overfitting. L2 doesnot completely make hidden unit zero but penalizes for having higher weight ,
  • Whereas in case of Dropout regularization It knock out hidden unit randomly making more smaller neural network which reduces the overfitting problem but it has one downside that is it may hamper decreasing cost function so dropout should me made in such a way it doesnot hamper cost function mostly dropout is used with the Computer vision.
  • ALso overfitting can be solved by having more data which is achieved by data augmentation.
  • Also early stopping at point where devtest cost function stop decreasing reduces overfitting but it hamper gradient descent to reach optimal solution i,e It breaks orthogonalization which means focusing on one task independently without distrubing other i.e gradient descent and regularization.
  • 📚Resources Deep Learning Specialization

Day97

  • Today I learned about normalizing input features to make gradinet descent faster , Vanishing and exploding of gradients, Checking of gradient descent properly before training model to see if it is working correctly with the help of numerical approximation of gradient i.e Gradient checking and practical implementation of gradient checking from course DeepLearning Initialization , I hope you do the same and here is short summary hope you get some idea:
  • Normalizing input features : Normalizing Input features brings all features in same range and helps gradient desent to converges faster
  • Vanishing and exploding gradient : In deep neural network if weight is too large or too small during back propagation, then weight may explode or vanish respectively which make deep neural network learning difficult it can be solved with xavier inatilization also known as (glorot) , He initialization or with other proper initialization of weight .
  • Gradient checking : Check the approximated and actual derivative of weight if there difference is 10^-7 then it is good to go if it is more debug. Gradient checking is not used in training , Include regularization term before gradient checking and use gradient checking without dropout and later use dropout. Implementation of random and he initialization:
  • 📚Resources Deep Learning Specialization

Day98

  • Today I learned about batch gradient descent,mini-batch gradient descnet and way of choosing mini batch size , bias correction , Hyperparameters and Hyperparamters tuning to find better result from the course Deeplearning initialization and also spended some time implementing regularization from assignment.
  • Mini-batch size : When training set are too large to be train on one epochs mini batch gradient descent can be used and selecting batch-size depends upon the computational capcity of gpu/cpu. It should not be too large and too small , It is choosen as :
    • If small training set use Batch gd.
    • Typical mini-batch size are in 2^n depending on cpu and gpu size.
    • Many researcher use largest batch size that can fit in gpu
  • Where as Bias correction is necessary to handle initial bias in weighted moving average so that initial bias doesnot affect the whole curve.
  • Hyperparameter : The most important hyperparameter is learning rate ,optimizer,activation function, hidden unit , batch-size, learning rate decay as well as regularizaition strength i.e L1/L2,dropout rate and initialization method
  • Also hyperparameter can be choosen in different method like using coarse to find hyperparamter , picking at random,grid search ,bayesian optimization.
  • Organizing Hyperparamter process : It is done either by babysitting one model if computational power is limited or training many model at parallel if enough computational power.
  • 📚Resources Deep Learning Specialization

Day99

  • Today I learned about batch normalization and its whole working process like normalizing activation outputs, fitting batch norm into neural network, batch norm at test time from course deep learning sepcialization and spent some time implementing it .
  • Batch Normalization : Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks. So it advantages are it make network stable,faster,has regularizing effect and due to batch normalization weight initialzation become less important. Batch normalization introduces four parameters: two learnable parameters (alpha and beta) that provide scaling and shifting effects, and two non-learnable parameters (mean and standard deviation). At test time it is done with the help of Exponential Weighted Moving Average (EWMA) of mean and standard deviation
  • Here is a sample way of applying batch normalization in tensorflow : Batch norm
  • Also you can apply batch normalization after activation function and in some case it may perform well.
  • 📚Resources Deep Learning Specialization

Day100

  • Today marks my 100 days of code achievement! I've never been this consistent before, and maybe because this field excites me to sit everyday and learn something new , and with each new algorithm I learn, it feels like gaining new superpowers. Today I learned about structuring and optimizing machine learning projects using orthogonalization,single number evaluation metrics ,optimizing and satisficing metric from course deeplearning specialization.
  • Orthogonalization : It is an system design property that ensures modification of an component of algorithm doesnot create a side effect to other components. For e.g : Early stopping any algorithm to prevent overfitting may create a effect on cost function so this is not consider better interm of orthogonalization.
  • Single Number evaluation metrics: Evaluating between two different classifier can be bit confusing using precision and recall so single number evaluation metrics for them may be F1 score which can tell which classifier is best.
  • Optimizing and Satisficing metrics : Optimizing metrics are those metrics that you want to maximize or minimize to get best possible outcome Whereas satisficing is the metrics that are used to ensure certain threshold or minimum requirement is met. In realworld examples combination of optimizing and satisficing metrics is used . for e.g : For any classifier algorithms its optimizing metrics may be F1 score or accuracy and its satisficing metrics may be runtime.
  • 📚Resources Deep Learning Specialization

Day101

  • Today I continued my learning on structuring machine learning projects and learned about train/dev/test distribution its size and when to change dev/test sets and its metrics, comparing with human level perfromance /bayes error and learned to create custom loss and use it from the course DeepLearning specilization.
  • Managing Train/Dev/Test distribution : Dev and test set should be of same distribution . Choose a dev set and test set to reflect data you expect to get on future and consider important to do well on.
  • Size of Test and Dev set : Set your test set to be big enough to give high conifence in overall performance of your system.
  • Things to remember to move in right direction while working on machine learning projects:
    • When your metrics is not considering the problem you want to solve tweaks the metrics .
    • Orthogonalization on metrics : Define a metric to evaluate the model, worry separately on how to do well on this metrics.
    • If doing well on dev and test set but doesnot do well on your application change your metrics or dev/test set.
    • E.g : A few ml engineers spending months on tuning their model performance looking to dev set + metrics and when they check on test set with different distrbution it may not perfrom well .
    • Comparing with human level performance : When you surpass human level performance then the accuracy maynot get incrased rapidly and will not go above a line called baysian optimal error. When can follow avoidable bias and avoidable variance tactics by comparing training , test error and Human error proxy as bayesian optimal error. It means that our expectation of model will not be 0% error.
    • Avoidable bias is gap between bayes error/Human level error and training error and can be controlled by using bigger network,training longer/better optimization. The gap betweeen training error and test error is variance problem and can be controlled by more data, regularization, hyperparameter search. At last I Learned to create custom loss and create hubber loss that combine both mean squared error (mse) and mean absolute eror (mae) , to use custom loss after loading model we have to provide custom loss you can see from my code below :
  • 📚Resources Deep Learning Specialization

Day102

  • Today I learned about the best practices to do while developing a machine learning applications by reading and evaluating my understanding from the case study of developing bird detection system in city of pacqueta , error analysis , directing to solve error with the help of misclassified dev set and some healthy habits of machine learning projects development.
  • Error Analysis : Lets go through a example of cat classifier , If we build a cat classifier and found out few dogs were also classified as cat now what should we do , Should we spend time on dog classifier ?. The answer is ~ check 100 misclassified dev set and count how many are dog . the % of dog in misclassified is the percentage you can get your error down by. IF you got good decrement on error you can spend time on developing dog classifier also .
  • Evaluating different mislabelled dev set in parallel : Look out the mislabelled percentage of differnet items and work on making classifier of items that has highest % of misclassification .
  • Always remeber to apply changes you made on dev set , to test set so they both have same distribution .
  • Also Build your first system quickly and iterate , after that you can work on error analysis and which direction to move for better performance. Here is a sample of error analysis to get idea click error analysis
  • 📚Resources Deep Learning Specialization

Day103

  • Today I learned about training and testing on different distribution, Bias and variance with mismatched distributions and Addressing data mismatch from the course Deep Learning Specialization .
  • Training and Testing on different distribution : This can arise in different machine learning projects , we also choose our test/dev set that represent data distrbution that we will encounter on real world application , dev/test set are the data in which we want to perform better. For e.g : For a cat classifier we can get million of data that are very clear images of cat from internet but in application user may upload blurry images from camera. So we should make our dev/test set compose of this blurry image from mobile and some amount of this data on training set.
  • In such situation our bias and variance maynot be seen on training error and dev error gap due to difference in data distribution . so we create a new set called training-dev set which contains training data and dev data, training error and training-dev error will show variance situation which help to see variance problem and the gap between training-dev and dev/test set will show data mismatch situation .
  • Addressing data mismatch : Data mismatched problem is addressed by getting more data of test/dev distribution using data synthesis , but during data synthesis be very careful about including tiny subset of data .

Day104

  • Today I learned about transfer learning and multiple task learning and when this technique should be used from the course DeepLearning Specialization.
  • Transfer Learning: It is a technique where knowledge learned from a task is reused in order to boost performance on a related task. It is mostly used when you have for the problem you're tranferring from and usually relatively less data for the problem you're transferring to . Also the task should have similar input . for e.g: A cat classifier model trained on 1 million data i.e pretrainged model is used for dog classfication by finetuning i.e changing output layer or additional layer if needed
  • Multi task Learning: It is a technique that enables single model to learn multiple taks at once . It is used when training on a set of tasks that could benefit from having share low level features. Usually amount of data is quite similar for each task. for e.g: A object detection that detects car,pedestrian, cycles,etc .
  • 📚Resources Deep Learning Specialization

Day105

  • Today I learned about end to end deep learning approach and pipepline deeplearning approach and there use cases from course DeepLearning Specilaization.
  • End-to-end deeplearning : It works well when the data is very large to map input x to the output y and in this approach it may be useful because it lets data speak and doesnot get limited on human perception. Less hand designing of component is needed in this approach.
  • Pipepline approach: In this approach of deeplearning most important component are only focused for example if you need employee detection system you first take whole picture and keep the face picture and compares it with the orginal dataset to match rather than learning from whole image. It is very useful when large data is not available and It focuses on useful hand-designed component.
  • Also Implemented transfer learning to create a classifier that detects simpson cartoon character using VGG16 architecture with some fine tuning.
  • 📚Resources Deep Learning Specialization

Day106

  • Today I completed the prediction portion of Messy or clean room detection using VGG16 architecture by transfer learning, and revised some computer visions topics like edge detection,padding and strides from course DeepLearning Specialization.
  • Padding : In normal convolution operation edge are given less emphasis as compared to central area and image shrinks in each convolution to avoid this padding is used which make a external border around the image known as pad which avoid image shrinking and less emphasis on edge. It is of two type : valid means no padding and "same" means pad so output size is same as input. Here is the some insights of making prediction of model that I saved which was previously trained using transfer learning. making prediction
  • 📚Resources Deep Learning Specialization

Day107

  • Today I revised about convolution over volume , notation of convolution layer, Types of layers in CNN, Convolution Neural Network example (LeNet-5) and why convolution instead of neural network from the course DeepLearning Specialization and also spended some time developing cnn from scratch and built zero padding function and forward prop function.
  • Convolution over volume : When some convolution operation is applied to the input images using some filters then some features are extracted . On a Single image of multi channel i.e RGB image a multichannel vertical edge detection and horizontal edge detection filter is applied which has same channel to images and new Ouput of dimension i.e displayed in images below will be obtained whose channel is equal to the number of filters.
  • Types of Layers in CNN : Convolution layer, pooling layer, Fully connected layer or Dense layer.
  • Input Image is taken an some filter is applied to detect feature for the input image called convolution operation , the layer formed is known as convolution layer in such layer pooling layer is applied to reduce dimension this combination of convolution + pooling is taken as 1 layer and many such layer are obtained which generate a final convolution layer that is then flatten and provided to fully connected layer and at last require prediction is made.
  • Why Convolution : Convolution is used instead of simple nn beacause:
    • Parameter sharing : A feature detector that is used in one part is useful for another part which reduce required number of paramters.
    • Sparsity connections: Output values depends upon only a samll number of input.

Implementation of CNN From scratch

Day108

  • Today I Learned about some classical CNN architectures like LeNet-5(60k parameters), AlexNet(~60M parameters), VGG-16(~138M parameters) also learned about ResNets, 1x1 convolution, Inception architecutre, and also created pooling function for CNNs from scratch.
  • ResNet : The Residual Network (ResNet) allows to train deeper neural network i.e many layer without exploding and vanishing of gradients by skipping connections, while skipping of connection the later activation must match dimension to the previous activation that may be changed by pooling, this matching of activation is done with the help of activation in later activation functions. So with the help of ResNet we can train deeper neural network that bring greater performance which was limited in theory before the invention of ResNet.
  • 1*1 convolution : 1x1 filters are used to decrease the channels , it reduces the channel equivalent to number of 1x1 filters used, which may be helpful since it reduce computational complexity and give more emphasis on important channels.
  • Inception : The important features of input sometimes get distributed more locally and sometimes get distributed more globally, which may need different filter size to detect them properly, previously same size of filter was used throughout the architecutre but inception stacks different dimensional filter together and to reduce computational complexity it uses 1x1 convolution.
  • pooling function to downsampling image
  • 📚Resources Deep Learning Specialization

Day109

  • Today I dive deeper into the intuition of depthwise separable convolution,bottleneck in MobileNetV2, MobileNets , MobileNetsV2, Efficient net from the course Deep Learning Specialization.
  • MobileNets : The main idea behind the MobileNets is low computational cost and can be used for mobile device and embedded vision applications. Instead of performing normal convolution MobileNets uses depthwise separable convolutions which is made of depthwise and pointwise convolution which help in reducing number of paramters.
  • MobileNets V2 : The mobile net V2 is the improved version of MobileNet which is made of input->expansion(point wise convolution that increases number of channel)-> depthwise convolution that has filters of fxf size and number of filters equal to channel of expanded conv-> projection which is aslo pointwise convolution that decreases channel-> output layer. This middle (i.e expansion->depthwise->projection) is also known as bottle neck reduces computational cost providing good performance in hardware limited scenario like mobile app , embeded system.
  • Efficient net : The Effiecient is another architecutre that provides flexibility to change resolution, depth and width of convNet to get best performance within your computational budget.
  • 📚Resources Deep Learning Specialization

Day110

  • Today I learned about the open source Implementation of different network archtecture built and using them , transfer learning, data augmentation, state of computer vision from the course deep learning specialization and spended some time implementing image augmentation on a single image.
  • Data augmentation : The technique use to generate datasets of variation when datasets is limited by applying different transformation like mirroring, random cropping, rotating, shearing, color shifting . It reduces overfitting problem and increase performance of model.
  • State of computer vision : If you have more data than less hand engineering is required and If you have less data then you can use more hand engineering and transfer learning if needed. Image augmentation implementation
  • 📚Resources Deep Learning Specialization

Day111

  • Today I revisited object detection in deep learning and learned about object localization,landmark detection, Sliding window, YOLO : IoU,Non max suppression,anchor box from the course deep learning specialization.
  • Object Localization: Detects objects and draw bounding box around them i.e location of the objects.
  • Landmark detection: It is the process of detecting significant landmark withing the image, these landmark are the point or features of interest within the image such as corners, edges, object keypoints, facial features, or any other salient points that help describe the structure or content of an image.
  • Sliding Window by convolution: Sliding window according to stride over image may be computationally costly so instead of it we use fully connected layer as convolution and create all possible sliding window at once and select the one with objects but this method maynot give accurate bounding box
  • YOLO algorithm: This algorithm works well on detecting bounding box than sliding window and it is faster . In this algorithm we take an image and split it into an SxS grid, within each of the grid we take m bounding boxes. For each of the bounding box, the network outputs a class probability and offset values for the bounding box. The bounding boxes having the class probability above a threshold value is selected and used to locate the object within the image.
    • Intersection Over Union: IoU measures the overlap between ground truth bounding box and predicted bounding box. IoU= Area of Overlap/Area of Union.This IoU makes the preformance of YOLO better by providing more accurate bounding box.
    • Non-Max suppression: It selects the bouding box with maximum probability out of different bounding boxes.
    • Anchor box: When object of different aspect ratio is in the image then anchor box is used to detect them seperately
  • 📚Resources Deep Learning Specialization

Day112

  • Today I learned about Image segmentation, transpose matrix multiplication, U-net architecture and spended some time creating a CNN model to detects wether the person is similing and sign language detection using Sequential Api and Functional Api from the course Deep learning specialization.
  • Image Segmentation : It is the method to label every single pixel instead of drawing bounding boxes in the images and used in automonous driving car, medical field. U-net architecture is one of the popular image segmentation archtitecture . It first use normal convolution and pooling in contraction and in expansion it use transpose convolution with skip connection that provides activation function of higher detailed low level spatial information. Then at last 1x1 convolution filter equal to number of classes to segment is used resulting output with dimension HxW equal to input and 3rd dimension equal to number of class to segment.
  • Functional API vs Sequential API : Sequential API works very well with linear topology but for non-linear topology you need to use Functional API. I will share some code snippet that reflect way to use Sequential API and Functional API.

Seqeuntial API use to detect smily face: Functional API use to detect sign language:

Day113

  • Today I dive deep into the concepts behind facial recognition system like one shot learning , siamese network , triplet loss from the course deep learning specialization.
  • One-shot learning: It is a classification algorithm that assesses the similarity and differences between two images. For example for face recognition of a person you get one image from camera and compare it with the image in database . You get some sorts of scores and if the least score is less than threshold ,you are recognized . Thus you don't need to train on thousand of images in different lightings so it is called one shot. This one short learning uses network called siamese network and has loss called triplet loss.
  • Triplet Loss: Triplet loss is a loss function that recognize the similarity or difference between items. It uses groups of three items called triplets they are : anchor , positive(similar item) and negative(dissimilar item). The loss function encourages the embedding of anchor and positive item to be closer then the embedding of anchor and negative item at least by certain margin. Implementation of triplet loss :
  • 📚Resources Deep Learning Specialization

Day114

  • Today I learned about face recognition technique as binary classification problem, Neural style transfer its cost function , Deep convnet learning from the course Deep learning specialization and spended sometime implementing resnet.
  • Face recognition as binary classification problem : You have images of verified person in database and you can precompute the embedding of this images and the new images whose face has to be recognized is now embeded and this input image embedding and database embedding is handled by sigmoid function that takes z=Wx+b, where x is the difference of each elements of embedding give 1 if same and 0 if dissimilar face is found.
  • Neural style transfer : It is the technique in which it have content image and style image , combines them together to generate a new image of content drawn in style refrence.
  • At first the generated images is initialize as a noise and after running gradient descent and minimizing the cost function iteration by iteration the generated image will look more like the rendered image that combines style and content image.
    • Style cost function: It is basically the euclidean distance of gram matrix of style image and generated image, where gram matrix is the correlation between activation in k and k' channel where k' means prime channel . Style in any layer is defined as correlation between activations across channels.
    • Content cost function : It is basically the euclidean distance between activation of content and activation of generated .

    NOTE: combination of content cost and style cost give cost function for Neural style transfer

    • Deep ConVnet Learning : The shallower layer detect simpler features like edges, color contrast, corners and deeper layer detect complex features like water , birds legs, people,etc
  • Also spendeded some time implementing resnet and created function for identity block that show skip connections.
  • 📚Resources Deep Learning Specialization

Day115

  • Today I created ResNet-50 from scratch first I built convolutional block function that skips the connection and shortcut path has convolutional layer to equalise dimension of input and output and also used yesterday built identity block that skips connection but it should have same dimension of input and output and it doesnot have convolutional layer on shortcut path. Using this two function I built the entire ResNet-50 model inspired from the course DeepLearning specialization.
  • Convolutional_block : It skips connection , has convolutional layer in shorcut path to make input and output dimension equal
  • ResNet-50 : Built using convolutional_block and identity_block
  • 📚Resources Deep Learning Specialization

Day116

  • Today I implemented the MobileNet-V2 using transfer learning for image classification task and obtained around 78% accuracy on 5 epochs, since it was the binary classification problem I only used on neuron in dense layer . The overall purpose of using MobileNet-V2 was its tradeoff on accuracy and performance , since it get run in small memory because of depthwise separable convolutions which reduces the number of trainable parameter . Below is the code snippet of transfer learning hope you get some insights reading it , I will put more concise explanation once I complete it fine tuning aiming more accuracy.
  • 📚Resources Deep Learning Specialization

Day117

  • Today I finetuned my MobileNet-V2 , while fine tuning the first layers are for detecting low level simple features like edge,color so changing them is not that important instead you should modify the deeper ending layer because they are made for detecting high level features, I first unfreezed the layer, I started finetuning the layers form 120 to 155 which was last layers and run the 5 more epochs from previous left off and increase the accuracy from 78% to 92 % . Below is the snippet of code hope you get some insights reading it.
  • 📚Resources Deep Learning Specialization

Day118

  • Today I dive deeper into the YOLO model for object detecting , got better understanding about the underhood function like filtering the boxes , Applying Non max suppression to select the best bounding boxes among the confident one , using the IoU threshold to remove overlapped bounding box and overally how the image is converted to encoding by model and the encoding is changed to bounding box that detect objects with scores and classes .
  • Below is the code snippet of working of yolo model

Let's breakdown the steps:

    1. Your Deep-CNN yolo models take image as input and give you the emedding as output
    1. We will now take this embedding and convert it to the best bounding box,scores and class name using some of our custom functions.

Yolo filter boxes : It returns the filter boxes above the threshold from all H_imageW_imageanchor_box ,

IoU : It is a function used in non-max suppression to remove boudning box above the IoU threshold.

Non-Max suppression : It select the most confidence bounding boxes among all the boxes.

Converting yolo model encoding to the bounding box, scores and class

Putting all together : Using above function to detect object from the image using pretrained model.

Day119

  • Today I implemented U-net architecture to perform image segmenetation which means labeling each pixel of image, It is basically made of ;
    • Encoding part : At first image is taken by conv_block() which downsamples the image by performing normal convolution and return next_block and skip_connections, where next_block is used by upcoming convolution that also downsamples and skip_connections is used by corresponding decoding block.
    • Decoding part : It takes previous layer as first parameter and skip connection as second paramter and performs Transpose convolution to upscale the image and at last convolution the number of filters of convolution is equal to number of classses to be labeled.
    • U-Net Model : It uses this encoding and decoding to create the segmenetation of image .
  • 📚Resources Deep Learning Specialization

Day120

  • Today I started LLM from scratch and learned about create own kernel space, setup all the environmen required and learned about word, subword and character tokenizer, where
  • Work tokenizer means splitting the raw text into the words base on delimiter this create a huge vocabulary problem , whereas charcter tokenizer splits the raw text into each character this has small vocabulary but single text doesnot provide context which is solved by subword tokenizer which splits the raw text into subword by : not splitting the frequently used small words , but splitting rare large word into small meaningful words. I also created a character tokenizer that encode and decode each character , I hope you gain some insight from it :

Day121

  • Today I explored the basic operations in pytorch , using of pytorch to process sequential data and learned to properly use the gpu using .to('cuda') when gpu is availabe and checked the difference between gpu and cpu performance in pytroch/ ptyorch_basic

Day122

  • Today I explored pytorch function and learned about different function like creating sample using torch.multinomial(), concatenation using torch.cat(), creating upper triangular and lower triangular using torch.triu() and torch.tril() respectively, and also about transposing , stacking , masking and mostly the basic building block of neural network torch.nn and torch.nn.functional ,
  • Below is the implementation of all the basic functions of pytorch
  • 📚Resources LLM from Scratch

Day123

  • Today I learned about nn.Embedding and dived deep into creating Bigram Language model that predicts the next token of word based on previous sequence of token, where model take vocab_size as input .
    • forward pass : It takes index and target = None as paramter and return logits and loss .
    • generate : It takes index of current context and generate index of current and next context. You can get some insights from below code , In coming days I will be finetuning the model created .
  • 📚Resources LLM from Scratch

Day124

  • Today I created the evaluation function and optimized my Bigram Language model, I have used adamw as optimizer and after few k of iterations bigram Language model was able to generate some text that was on similar format to training data.
  • BigramLanguge Model : A Bigram Language Model is a type of statistical language model that predicts the probability of a word based on the preceding word in a sequence of words. It is a simple and intuitive approach to language modeling and is often used as a baseline or for educational purposes. After this my bigram Language model was completed hope you get some insights reading this

Day125

  • Today I revisted some activation like sigmoid, softmax,tanh and dive deeper into the transformer architecutre and learned every detail posssible ,learned about
  • Masked mult head attention that only provide attention to current and previous token , it doesnot provide attention to next token because in this case model will memorize or overfit and will not learn from output positional encoding so masked is done uisng lower triangular matrix.
  • Also spended some time updating my Bigram Language Model to GPTLanguage model and create weight initialization funcition and continued forward pass by providing sequential decoder.
  • 📚Resources LLM from Scratch

Day126

  • Today I created decoder block of my GPT language model Which simply is like -> Multihead attention -> add and normalize -> feedforward -> add and normalize -> after the creation of Block class then I created the feedforward class that is used in decoder block which is simply like Linear-> ReLU -> Linear . Below is the code snippet of Decoder Block and its feedforward portion.
  • 📚Resources LLM from Scratch

Day127

  • Today I created Multihead attention class and Block class , where multi head attention is used to provide the realtion of each word with others , it provides dependecies of token or word and most importantly I used nn.Modulelist() for head which make them run parallely making multihead attention faster .
  • Below is the code snippet for this Multihead attention.
  • Head
  • Multi-Head attention
  • 📚Resources LLM from Scratch

Day128

  • Today I continued to work on my Large Language model to create a system that will feed the large corpus of text data to LLM for training which was openwebtext data with around 20k separate files, So I write my script that was able to read all file in format .xyz and append them into list.
  • From list of .xyz file I splited them into parts each part containing few file
  • Loop was run to each part and the vocab of the part was updated i.e set of characters was updated.
  • each character was store in separate line.
  • 📚Resources LLM from Scratch

Day129

  • Today I updated my data feeding scripts where I created system to process training and validation data seperately which seem more effective than storing the whole corpus on single file and dividing while training it will be more inefficient , learned about authenicating admin rights while running script that need admin rights like extracting of whole 20k files.
  • 📚Resources LLM from Scratch

Day130

  • Today I revisted concept of RNN its working and different types of RNN like one-to-one , one-to-many, many-to-one and many-to-many and their used case accordingly , also learned about Loss function for RNN model , forward propagation, Backpropagation through time for RNN model . Also spended some time on language model and sequence generation .
  • Language Model and Sequence generation: Language model take sentence as input predict its probabiliy given the previous output . By using this language model we can create sequence generation where word sequence can be generated . Today I started shakesperian text generation and created data reading and encoding scripts for today
  • 📚Resources Deep Learning Specialization

Day131

  • Today I learned about topics like sampling novel sequences, Gater Rectified Unit (GRU) , Long Short Term Memory (LSTM), Bidirectional RNN, Deep RNNs from the course deep learning specialization.
    • Samping novel sequences was using np.random.choice on output softmax probability to pick any random choice and feeding it to input of next time state.
    • GRU was the more simpler version of LSTM which used two gates i.e Update gate and Reliablity Gate wheras LSTM used 3 gate update,forget and output gate. Simpler RNN were not able to capture the long range dependecny so LSTM and GRU model were used.
    • GRU is used for simpler task that need less computation whereas LSTM is more powerful and computational heavy.
    • Bidirectional RNN has ability where output layer can get information from past and fututre state, BRNN combined with LSTM and GRU are used in many task.
    • DEEP RNNs , RNN,GRU and LSTM in a sequence stacked upon layer to layer is also used in complex nlp where long range dependencies is need to be captured Also spended sometime creating test and validation set for shakspeare text generation

Day132

  • Today I train RNN modle combined with GRU for shakespeare text generation , also completed the programming assignment of building RNN model from scratch from the course deep learning specialization.
  • Here is code snippet of today progress on training RNN model for shakespeare text generation.
  • 📚Resources Deep Learning Specialization

Day133

  • Today I continued to learn more about nlp and word embedding, word embedding uses featureized representation of words in higher dimension which provide the relation of words with eachother and it can be visualized in 2D using t-SNE, word embedding is mostly done in following ways:
  1. Learn word embedding from large text corpus i.e 1-100B words or download pretrained word embedding.
  2. Transfer embedding to newer task with smaller training set i,e around 100k words.
  3. Optional : COntinue with finetuing word embedding with new data
  • Also learned about the properties of cosine similarity and how they are used to represent two different word embedding with similar features like man as to woman is similar to king as to queen.
  • Embedding matrix * one hot encoding of words = Embedding vectors .
  • Also completed week1 first assignement of building rnn from scratch : click
  • 📚Resources Deep Learning Specialization

Day134

  • Today I learned about different embedding algorithm using in language model and how they works like continuos bag of words and Word2Vec embedding from the course deeplearning specialization.
  • Continuous bag of words simple use surrounding words to predict the target words which require more computational power because more words has to be taken as context.
  • Word2vec/skipgram uses the technique of taking one word as context and taking random target word to calculate P(context|target) but it also has to calculate softmax for large number of vocabulary so it uses negative sampling that takes one positive sample and k negative sample 2<=k<=20 which reduces computation. Also spended sometime reading regular expression from book speech and language processing.
  • 📚Resources Deep Learning Specialization

Day135

  • Today I learned about Global vectors (GloVe) which is also a popular embedding technique that uses co-occurrence matrix to encode the meaning with other words in large text corpus, also learn to use word embedding with RNN for nlp task like sentiment classification, learned how the embedding vectors can learn the sterotpye like man:woman is as computer_programmer:house_wife and the way of debiasing such bias by:
    • Identify the bias direction
    • Neutralize for non-definational word like programmer,doctor,nurse in gender scneario by using linear classifier to identify the non-definational word.
    • Equidistant definational word like son and daughter from non-definational word like doctor.
    • 📚Resources Deep Learning Specialization

Day136

  • Today I worked on a character level language model for generating name using RNN and the steps where clipping gradient for exploding gradient, sampling novel sequence , optimizing and final model was ready which was able to generate the new dinosaur name.
  • Spended some time learning basic model, most likely sentence and beam search from the course deeplearning sepcialization.
  • During speech translation it is not good to use random probability as in language model , in speech translation we use most likely sentence that is selected using an algorithm called as beam search.
  • Beam search in an algorithm that uses conditional probability to select most likely sentence by conisdering highest probability, number of possibility is determined by the parameter known as beam
  • 📚Resources Deep Learning Specialization

Day137

  • Today I optimized the previously developed dinosaur name generating model which was basically using simple RNN model with sampling novel sequence to get unique name and it did pretty well and cool dinosaur name was generated.
  • Also spended some time training the keras model that uses LSTM to take care of larger dependencies to generate the poem given one word that was similiar to shakespeare style because it was trained on shakespeare text of poem.
  • 📚Resources Deep Learning Specialization

Day138

Day139

  • Today I spended time creating webapp for neural style transfer , which has some option for style and option for input image , the webapp load pretrained model to stylize the image. In neural style transfer the content image is transferred into the artistic way of style image . I used the available pretrained VGG16 model for different sytle and created sepearte function for loading model and styling which helped in caching making the webapp more faster. Style transfer Webapp

Day140

  • Today I learned about Length normalization , beam search algorithm and error analysis in Beam search
  • Length normalization is basically dividing the log of probabilites by length of the sentence Ty , using log result in more stable output because the multiple are added on log ,
  • Beam search algorithm as we already know is used to find the most likely sentence it has parameter B which if become large result in better result but slower and if small result in worse result but faster so proper tradeoff is required.
  • Also learned about error analysis with beam search, If the probability of human translation is greater than probability of algorithm translation then algorithm choose less likely sentence so Beam search is at fault and if the probability of algorithm is greater than human translation then RNN fault and by looking at fraction of this we decide to mitigate error i.e RNN fault solve or Beam search fault solve.
  • Also scratched my head on Retrieval Augmented Generation which basically solve the hallucination of LLM like whem LLM doesnot have source to answer or LLM is outdated to user query than it may give its madeup answer which may be wrong so RAG add a source where the response of LLM is checked to give the correct answer to user without hallucination.
  • 📚Resources Deep Learning Specialization

Day141

  • Today I learned about Blue score from the course deeplearning specialization and spended sometime implementing it.
  • Bleu(Bilinguial Evaluation Understudy) score is a single evaluation metric that checks how good is machine translation by comparing to refrence human translation . IT is mostly used in speech translation and some time in text generation related task, Below I have provided the sample of blue score on two prediction and see how a good prediction got higher bleu score.
  • 📚Resources Deep Learning Specialization

Day142

  • Today I dive deep into the intutition of attention mechanism from the course deep learning specialization and also implemented GRU to generate shakespeare text and since training such model on large corpus of data takes time I tried to train it on gpu but failed, also revisited the concept of attention model and how it capture more longer dependecies than LSTM .
  • Below is the code snippet of shakespeare text generation.
  • 📚Resources Deep Learning Specialization

Day143

  • Today I finally concluded my shakespeare text generation model by training it on local GPU which was showing some good sign as it was able to generate text that was similar to shakespeare style, I use random sampling to generate the next text which help in more creative text but, training the GRU only seems to be not good enough to capture long range of dependencies as the output generated was not completely right.
  • Below is the snippet of shakespeare text generation I hope you have some good time training you own model and tweaking parameter to generate even better result.
  • shakespeare text generation
  • 📚Resources Deep Learning Specialization

Day144

  • Today I dive deep into the most interesting topic that was the attention model using bidirectional RNN and get better understanding of how attention score is calculated by feeding activation of input RNN and RNN that takes context .
  • Even learned about speech recognition system that uses attention or specially CTC method and learned about the trigger word detection technique in RNN from the course deeplearning specialization.
  • 📚Resources Deep Learning Specialization

Day145

  • Machine learning is not always about creating notebook , preprocessing data and running model making predictions in a complete machine learning cycle there is requirement of well organizing your project from creating virtual environment , to setting up things in case of launching your own packages
  • so, today I learned about creating setup.py to make your python projects a complete packages and using Hypen e dot I was able to achieve setup.py installation from installing requirements.txt as 'pip install -r requirements.txt'
  • In a simple words I learned to create pacakges.
  • Below is the code snippet of setup.py hope you have good time creating your own packages. End to End Machine learning project

Day146

  • Today I continued more about designing end to end project and learned more on file stucture and developed a custom error handling scripts that was based on sys module which contain the detail of information which help to detect the error details like : error message, erro filename and line number of here.
  • Below is the code snippet of exception handling hope you have good time finding those error and fixing while running your end to end ml model.

Day147

  • From today I am going to continue my 300daysofdata toward the direction of nlp specialization embracing modern technology like llm. I have studied about RNN,LSTM ,GRU and transformer many time but this journey from now is gonna be more practical approach . I am going to start from text processing and will be learning newest technology like langchain and vectordatabase so if you are also planning to specialize in nlp lets get connected and start.
  • Topics : Tokenization, Stemming and Lemmatization , Text preprocessing ,stop word, punctuation removal and text normalization, Spacy
  • Hands on : Applied the above concept on project called Anime recommender system.

💡 Notes:

  1. Tokenization means breaking the sentence into words or part .
  2. Lemmatization is the process of reducing the words into its root form or 'lemma'.
  3. Stemming is the process of reducing the word into its stem which is almost similar to lemmatization but it is less accurate and computationally faster than lemmatization.
  4. Removing stop words is the process of removing words like 'and' , 'the' which don't convey much meaning.
  5. Text normalization is the process of converting text to more consistent form like converting to lowercase.

Below is the sample code snippet that cover all concepts hope you have some good time understanding this basic : Output:

Day148

  • Topics : Bag of words, TF-IDF, World2Vec i.e text representation.
  • Hands on : Using Spacy to build toxic comment classifer 💡 Notes:
  • Bag of words : where each element corresponds to the frequency of a word in the document.
  • TF-IDF : In this technique, The importance of a term is based on its frequency in the document and its rarity across the entire corpus
  • World2Vec: Word2Vec represents words as continuous vector embeddings in a high-dimensional space. 🎯Note to take: Out of this 3 method Word2Vec is more advanced which capture semantic realtion and require large data . 🧠 Also implemented this text preprocessing and text representation to build toxic comment classifier below is the code of this classifier using count vectorizer which was doing decent but I found that when the word hate was coming it wrote it as toxic even the comment was not toxic, so I have decided to use word2vec next and see the result. Hope have some good time build you own toxic comment classifier.
  • Code snippet of toxic message classifier

Day149

  • Topics : World2Vec embedding and its implementation in toxic comment classifier which should resulted in better result than previous count vectroizer representations.
  • In my today learning and coding I found that my toxic comment classifier was doing well on the normal easy comments but for the comments like "Hate is everywhere but I don't hate you" it identified it as toxic comment which is not actually toxic. So I thought it was not capturing context like don't which actually was true becuase they where removed in stopword then I modified my stopword even after that there was no change and I use 100 dimension word2vec embedding instead of 25 dimension word2vec embedding which for sure increase some accuracy but didn't did well on that particular comment which should be non-toxic. ⚠️ SO I ended today session with challenge which for sure I will try to solve some of my next step will be using sentence transfomer.

Day150

  • Topics : sentence transformer and its implementation.
  • Notes : Today I learned about sentence transformer and implemented to calculate the similarity between two sentences.
  • Sentence transformer : It is a model that convert sentence into a vector representation that capture semantic information, making them useful for various NLP tasks such as similarity comparison, clustering, and information retrieval.
  • Below is the code snippet of implementing sentence transformer I hope you have good time implementing sentence transformer in your own applications.

Day151

  • Topics : Implementation of sentence transformer into toxic message classifier
  • Today I implemented sentence trasnformer embedding for representing sentence for toxic message classifier and train the logistic regression which as expected showed the improvement in accuracy to then that of word2vec embedding and also designed own custom words to prevent more context in sentence for some trick comments.
  • Below is the code snippet of toxic message classifier hope you get some insights from my learning.

Day152

  • Today I revisited the concept of transformer understanding it from the concept like embedding how the attention , multi head attention provide better embedding for words like apple which can be apple company or apple fruit from context of sentence and also played with hugging face pipeline for sentiment analysis and text generation. I will spending more time learning transformer block and how it is used in large language model.
  • Here is the simple code snippet of using hugging face transformer pipeline for sentiment analysis and text generation

Day153

  • Today I revised my understanding on attention-mechanism , transformer language models from the cohere LLM university course.
  • Attention mechanism in shorts : Sometime the word like apple can convey two meanings for e.g it can be company and fruit, which is consider by attention mechanism it distinguish between whether the apple is company or fruit by looking at the sentence context.

Let's talk about transformer language model in short as possible : It architecture is as follows

  • input -> (Tokenization)-> (embedding) ->(positional encoding) -> (series of Transformer block { attention component and feed forward layer }) -> (softmax layer) -> output
  • Tokenizer: Turns words into tokens.
  • Embedding: Turns tokens into numbers (vectors)
  • Positional encoding: Adds order to the words in the text.
  • Transformer block: Guesses the next word. It is formed by an attention block and a feedforward block.
  • Attention: Adds context to the text.
  • Feedforward: Is a block in the transformer neural network, which guesses the next word.
  • Softmax: Turns the scores into probabilities in order to sample the next word.

Macro precision,recall and F1 score for multilabel classification

Alsoe learned about Macro precision , macro recall and F1 score which is calculated by calculating precision recall and F1 score for each class and averaging it by number of classes in classification.

  • After that I covered some basics of langchain discovering its LLMchain and propmt_template and created a simple pet name generator here is the code snippet of it.

Day154

  • Today I launched my first langchain application with the help of streamlit and github also created functionalities of entering api key in webapp so that webapp can generate petname according to user input.
  • 🍃petname generator

Day155

  • Today I explore langchain agents which uses language model as a reasoning engine to determine which action to take in and in which order and also learned about agent tools which are the functions that invoke agents. Then I simply used two tools like wikipedia for information and llm-math for calculation part and build simple agent that gets average life of tortoise and human and subtract it , setting verbose = True gave me the inside process of agent
  • Below is the code snippet of agent that I built I hope you have some good time building your own agent for your own greater purpose.

Day156

  • Today I continued my learning on langhcain and learned about different components that perform different operations and used youtube loader which takes youtbube url and convert it to transcripts .
  • After that converted the transcripts into small chunks of 1000 chunk_size and it was stored on vector format i.e using FAISS .
  • Then later I created a function that takes vector stores of docs find similarity from docs according to query provided using openai text-davinci-003 model.
  • Also created UI using streamlit for user at last I was able to generate the useful insights from 3 hour long podcast.
  • Below is sample of the yotube assistant I build using langchain I hope you have good time reading my documentation the code of this will be out soon on my github.

Day157

  • Today I learned about Autogen and it is so exciting, Autogen are Multi agent working together to solve tasks iteratively.
  • It is mostly composes of Userproxy agent and assistant agent, Userproxy agent work on the behalf of user and give feedback to the assistant agent wheras the assistant agent intereact and provide accurate answer to userproxy agent from writing code to doing anything that user proxy desire.
  • Also learned about static and dynamic conversations , In static conversation agents work on defined topology but incase of dynamic agent topology can be changed in iteration and it types are auto reply and function calls.

Day158

  • Today I learned about creating structured data from unstructered data like text corpus using langchain and pydantic class which is one of my personal favourite work of langchain power and spended sometime exploring OpenAI function call which can be config as auto function call , no function call and compulsory function call accoriding to requirement.
  • Also Implemented score ranking system from the text using pydantic below is sample code snippet of using pydantic and langchain to extract structured data from unstructured data.

Day159

  • Today I explored about the Modules I/O of langchain and learned about output parser that is used to convert unstructured data into structure data, also spended some time learning to design the Response schema to get the output in JSON format which a success. I hope you also harness this powerful method of langchain to get strucutred data from the raw unstructured text data.

Day160

  • Today I learned about document loader component of langchain and explored different document loader like text,csv,Unstructured markdown and applied Unstructred markdown loader, converted it to vector form using FAISS .
  • Below is the sample code of using langchain to convert text data to vector form which is very essential in llm applications for example similarity_search.

Day161

  • Today I learned to integrate SQL in langchain by the help of which the natural language prompt can be converted into SQL queries which is used in chatbot that retrieves data from database or other nlp task that need to extract data from database.
  • Below is the simple snippet of integrating SQL chain in langchain hope you have good time using langchain.

Day162

  • Today I learned about Retrieval Augmented Generation (RAG) which simply prevent LLM from hallucination as it provides retrieval facility from user data. I also implemented simple RAG where I used document loader from langchain to read my github repo and ask it to prepare about the roadmap from day one to till date about what I learned and it did a pretty good job.
  • Below is the code snippet of RAG implementation in langchain I hope you also had a good time reading this documentation.

Day163

  • Today I implemented RAG system using Mistral7B model and builded QA bot that answer the user queries from the vector stores that stores vector embeddings of pdf file.
  • Below is the code snippet of QA answer bot I built using mistral7B model.

Day164

  • Today I learned to use LLM model locally using llama cpp , then I downloaded mistral 7B model and loaded it locally also learned about different parameter of llama cpp to make model configuration more efficient like n_gpu_layer which select number of gpu layer, n_ctx token window size, f16_kv which in true use half precision saving some memory

Day165

  • Today I revisited the hugging face before diving deep into llm and relearn about hugging face pipeline and breakdown its steps into textinput->(AutoTokenizer i.e handles both tokenization and encoding)->AutoModel(used transformer to give hidden states)->hidden states(high dimensional output of transformer)->Head(Contain different Model like AutoModelForSequenceClassification ) -> logits ->(torch.nn.functional.softmax() -> predicitions
  • Also spended some time learning to handle multiple sequence by adding additional dimension and learned to handle sequence of different lenght using padding.
  • Below is the code snippet and flow of pipeline of hugging face pipeline, I will be getting more deeper into hugging face for few days.
  • 📚 Resources : Hugging face Nlp course

Day166

  • Today I learned one of the unique concept where I chained two different chain like the first chain that generate output and second chain evaluate the result of first one, overally I created a Sequential chain where two different chain are chained together to make conversable agents.
  • Also continued my learning on hugging face where today I learned about finetuning the model and finetuned bert on few sentences as trail will be learning more on datasets prepartion for further pretraining of model.
  • Below is the code snippet of pretraining bert I hope you are also spending time pretraining model for your own greater purpose.
  • 📚 Resources : Hugging face Nlp course

Day167

  • Today I dive deep into preparing dataset for finetuning model where I used hugging face dataset library and tokenized the datasets used map function to batch the tokenized dataset and make it run smoothly on ram, Also learned to make every batch of equal length with dynamic padding provided by DataCollatorWithPadding of hugging face which is essentila for making quality datasets.
  • You can also explored similar topics from hugging face docs and prepare your datasets

Datasets Prepeartion Snippet

  • Also spended some time tinkering finetuning bert model using Trainer API and improve f1 score from 87% to 89% , you can check my code from below snippet for finetuning bert.
  • 📚 Resources : Hugging face Nlp course

Day168

  • Today I learned to create full training cycle to train bert using pytorch by preprocessing huggingface mrpc datasets, creating training and validation dataloader and preparing batch and passing it to mode and achieved f1 socre of ~90%
  • Also learned to use get_scheduler to control learning rate decay linearly from maximum i.e 5e-5 to 0.
  • Below is the code snippet of my training cycle to train bert you can also train your own model for greater use.
  • 📚 Resources : Hugging face Nlp course

Day169

  • Today I spended my time learning to use hugging face pretrained model using pipeline and Autoclass, also used conversational pretrained model to create simple chatbot.
  • Below is the code snippet of using pretrained model from hugging face hope you are having great time learning hugging face for nlp and deeplearning tasks

Day170

  • Today I learned to share pretrained model to hugging face models using huggingface_hub API also learned to version control model using git lfs which is the simplest way of controlling models versions.
  • Also It was fun to inference the model you finetuned or created to allow to public use or inference.
  • Below is the snippet of simple way to push your model to hugging face hub
  • 📚 Resources : Hugging face Nlp course

Day171

  • Today I dive deep into preprocessing Large datasets in hugging face hub where I learned about loading your own datasets using load_dataset() , then observing the some sample of dataset and preprocessing required part using map function and understood the power of using batched=True in map() which provided parallel processing to speed up preprocessing.
  • Below is the code snippet where I have done some preprocessing of datasets hope you gain some insight from it.
  • 📚 Resources : Hugging face Nlp course

Day172

  • Today I concluded datasets part from hugging face where I learned creating tokenizer and split facility that returns tokens in constant length and also return overflowed tokens, converting datasets to Dataframe format doing reuqired preprocessing and converting back to Dataset, preparing clean train,validation and test set and saving the dataset to local machine
  • Below is the code snippet of all twinkering with datasets I made today hope you gain some valuable insights and new ideas from it.
  • 📚 Resources : Hugging face Nlp course

Day173

  • Today I learned about memory mapping and streaming technique which is essential to have whem working with multi gigabytes of data, large datasets are needed to finetune LLM models but loading them is challenging task so it can be performed by memory mapping and streaming.
  • In streaming the whole data is broken down into smaller stream and one stream is processed at once in memory and that stream is removed and next stream is taken.
    • take() is used for taking first nth number of datasets. Useful for preparing small datasets like validation set
    • skip() is used to skipping first nth number of datasets. Useful for preparing larger datasets like training set
  • Below is the code for laoding 14gb datasets using streaming hope you get some insights from the code snippet .
  • 📚 Resources : Hugging face Nlp course

Day174

  • Today I learned about preparing own datasets from hugging face where I spended my time using githubapi to scrape data for issues and prepare datasets by loading that scrape data into Datasets format.
  • Below is the code snippet of preparing own datasets and loading it hope you get some idea watching the snippet.
  • 📚 Resources : Hugging face Nlp course

Day175

  • Today I learned about semantic search using FAISS which is really fast and efficient way of searching where the dataset is mapped to embeddings and using similarity search the nearest embedding is return as search result.
  • Also I loaded github issues data preprocessed it applied cls_pooling for [CLS] token which hold the previous state.
  • Below is the code snippet of semantic search using FAISS hope you gain some insight reading it.
  • 📚 Resources : Hugging face Nlp course

Day176

  • Today I created a dataframe for my semantic search result and arranged it on the basis of scores were highest scores was at top, in conclusion when user input the query the semantic search results in solutions that matches the query and display the comments , scores , title and url of solutions.
  • Below is the code snippet of the displaying semantic search result.
  • 📚 Resources : Hugging face Nlp course

Day177

  • Today I learned about benefits of training new tokenizer from the pretrained tokenizer :
  • for the text corpus that is different to the text corpus that language model was trained on
  • For data on different language , domains and so on.
  • Below is the code snippet of the training new tokenizer.
  • 📚 Resources : Hugging face Nlp course

Day178

  • Today I enjoyed understanding the token-classification pipeline implementing its underhood , at default AutoModel uses fast tokenizer which is very efficient when used with batched sentences. I also learned how aggregation style "simple" gives token classfication for words by returning means scores and mapping star:end of words
  • Below is the code snippet of token-classfication with simply pipeline and also its underhood .
  • 📚 Resources : Hugging face Nlp course

Day179

  • Today I dive deep into the underhood of QA pipeline by implementing it and learned about how masking is performed to remove query for context and how context starting and ending index is computed, also learned about theory of different tokenization technique like Byte-pair encoding,Word piece tokenizaion and Unigram tokenization.
  • Below is the code snippet of implementing QA pipeline and its underhood hope you gain some insights reading it and implementing too.
  • 📚 Resources : Hugging face Nlp course

Day180

  • Today I dive deep to implement every steps from loading datasets,tokenizer,model and preparing training arguments to finetune bert model for NER recognition and uploaded the finetuned model to hugging face.
  • Also implemented dynamic padding using collator function and designed custom evaluation functions for NER where the evaluation functions doesnot take padded values.
  • And spended some time researching different approaches that we can implement to imporve our RAG systems and learned about creating parent and child splits for better retrieval and avoid missing data and use choere ranking of our retrieval.
  • Below is the code implementation of NER recognition using bert model.
  • 📚 Resources : Hugging face Nlp course

Day181

  • Today I dive deep into domain adaptation, knowledge distillation and learned about Masked language model and preplexity which is evaluation metrics for genration task also spended time preprocessing imbdb data and finetuning mask language model in Imbd datasets for domain adaptaion and inference it at last.
  • Domain adaptation : It is the process of finetuning pretrained language model in domain data which makes it suitable for downstream task(i.e task you want to solve)
  • Knowledge distillation : Tranfering knowledge from large model to small model is called knowledge distillation which make the small model more powerful and save computational cost.
  • Masked Language Model : This model predict the words that should be filled in the blank. It is also used popularly for finetuning in domain specific data.
  • Preplexity : It is simply the exponent of binary cross entropy loss , less preplexity means better langauge model.
  • Below is inferencing snippet please check my github link for full finetuning process

Day182

  • Today I got to see the domain adaptation of model, pretrained model for english to french translation was missing some translation like for word email, plugin, thread . After finetuning it on english to french translation data that has translation for those words missed by pretrained model the model was adapted to domain and finetuned model while infrencing converted those english word to french easily.
  • Also learned about Blue score : which compares generated translation with human refrences using ngram model , higher blue score results in better translation.
  • Below is the inferencing showing differnece made by domain adaptation of english to french check full code in below link.
  • 📚 Resources : Hugging face Nlp course

Day183

  • Today I dive deep into summarization task of nlp where I spended by time creating datasets for model, preprocessed data and tokenized it to fine tune t5 model which was suitable for summarization task and also was multilingual. Also learned about evaluation metrics for summarization i.e ROUGE measures the overlap of words between the generated summary and the reference summary, providing a quantitative measure of the model's effectiveness.

Day184

  • Today I learned about knowledge graph where entity is represented by node and their relationship is represented by using edges between them which represents semantic relationship of text.
  • Also created a movie knowledge graph, extracted entities and created cypher function to reply movie released date 🎯Created the interface using gradio for chatbot.

🎯Below is the code snippet and demo of chatbot that search on knowledge graph to reply info.

Day185

  • Today I spended my time researching about the working of knowledge graph + rag to build a better llm applications and I started working on extracting relationships and entites to make graph database . SO at first I created a chain using MapReduceDocumentsChain()
  • MapReduceDocumentsChain() : This chain first passes each document through an LLM, then reduces them using the ReduceDocumentsChain. Useful in the same situations as ReduceDocumentsChain, but does an initial LLM call before trying to reduce the documents.
  • For better understanding of MapReduceDocumentsChain() check below image and code snippet.
  • Note : I personally find MapReduceDocumentsChain() very useful in case of preparing graph database and at end I got summarise relation enrich data from 77k length documents to 3k length documents capturing all semantic relationships.

Day186

  • Today I concluded my session of creating knowledge graph from text data using ne04j by using spacy for extracting entities and relationships usign llm. so overall steps for creating knowledge graph that I followed was :

    • First creating Summaries of text data or anyother data that is capturing all essence of data without losing relationships between entities
    • Second extracting entities and relationship using spacy LLM pipeline using spacy pipeline for relationships extraction
    • Third I used chatgpt to generate cypher query langauge for the json of file containing entity and relationships.
    • Final enter the cypher query language in ne04j workspace and your knowledge graph is ready.
  • Below is the code snippet of the extracting process of relationship and entities using spacy llm and the knowledge graph I created

  • I will be using this knowledge graph to build hybrid search in coming days so keep watching this repo of my learning .

Day187

  • Today I was astonished reading the paper of alpaca, revealing secret recipe for preparing datasets to finetune instructions following model i.e LLama7B to get some models like text-davinci-003 under very low cost i.e Alpaca Model.
  • Simply , The Llama7B was finetuned by 52k instructions supervised data that was generated using text davinci 003 by providing 175 sample of supervised instructions dataset by human and after finetuning alpaca model was obtained which was very small in size to text davinci 003 but the performance was almost near to it even better in some case.
  • So this research showed the way to obtain high quality instructions following model generated using high quality pretrained model i.e LLama model and using high quality instruction following data i.e using existing languge model to automatically generate instruction data

Day188

  • Today I dive deep into reading research paper named WizardLM: Empowering Large Language Models to Follow Complex Instructions where I get to know about evol instructions. Below is the summary of what I learn , you can enjoy this summary in just few mins
  • Training LLM on closed domain instructions i.e specific type of instructions resulted in two drawbacks : * 1. No diversity, common instructions * 2. Instructions only ask for one tasks wherase in reallife human instructions have muliple tasks.
    • So to avoid this LLM like openai where trained on opend domain specific task using human annotators which was a bit time consuming and expensive and the instruction where skewed towards easy and medium type of instructions in comparison to complex instructions.
    • So here comes the hero evol_instruct which is a novel technique to automate instructions generation using LLM which generate more balanced open domain instructions of various diffculty level to improve performance of LLMs.

    How it generates instuctions.

    • It generates instuctions in two ways by depth evolving and breadth evolving .
      • In depth evolving complex instructions where generate by adding constraints, deepening, concretizing, increasing reasoning steps and complicate inputs
      • In breadth evolving new complex instructions wherer generate to give diversifications

Results of evol instructions.

Day189

  • Today I dive deep into understanding the difference between open domain instructions fine tuning and closed domain instructions fine tuning.

  • Closed domain instructions finetuning fails in real world scenarios where the open domain diversified instruction finetuning on language model works well.

  • So evol instruct is used to generate such open domain diversed intrcutions dataset, Its pipeline consists of instruction evolver to create variation and complexity in instruction using LLM , whereas Instruction eliminator filters instruction that fails to evolve.

  • 📚 Resources : WizardLM: Empowering Large Language Models to Follow Complex Instructions

Day190

  • Today I spended my time learning to create Instructions dataset to fine tune LLM , where I found out that weight & biases provides a great visulazation of model training and its evaluations . Also weight and biases is great platform to observed or analyze your data.
  • Also while preparing data I learned about Packing where token id is putted together equals to max lenght of llm model which make model finetuning more memory efficients.
  • Completed reading the wizardLM whose main theme is use LLM to create complex instructions datasets which is more balance and superior to human created one because of time and skewness of human create data towards the simpler one.
  • Below is the sample code snippet of preparing datasets and using wandb.log to observe datasets. I will be finetuning LLama model in coming days I hope you have good time reading this.
  • 📚 Resources How to finetune LLM part 1

Day191

  • Today I dive deep into implementing packing instead of padding , also used weight&biases for saving packed data and I was happy with features provided by wandb like data version control and provding api to your dataset, packing generally add next sequence once the current sequence is finished to fill the window size instead of adding unvaluable padding tokens which generally reduces the computation and make finetuning more efficient as the number of sequence will be reduced also implemented packing.
  • Packing : Packing is a technique for LLM inputs to concatenate multiple tokenized sequences together into a single input, which we can call a ‘pack’. It eliminates padding which donot contribute to langugage understanding
  • Below is the code snippet where I have implemented packing into alpaca datasets and saved them to weight and biases artifact.
  • 📚 Resources How to finetune LLM part 1

Day192

  • Today I dive deep into understanding different concepts of decreasing memory while finetuning LLM models like freezing layers, freezing embeddings, gradient checkpointing and implemented them in code for finetuning Llama7B model.
  • If you are also interested in optimizing memory while finetuning LLM you can read following topics.
  • Freezing layers to save memory : Transformer based LLM like LLama2 are stack of identical 32 layered model so if you cannot fit whole model in GPU you can train last layer like 8 last layer which is responsible for predicitions.
  • LoRA : Also you can train whole model using efficient parameter finetuning technique like LoRA.
  • Freezing embeddings : Freezing embeddings can also decrease memory usage.
  • gradient checkpointing : Gradient checkpointing recalculates activations of certain layers during backward pass to save some memory taking addtional computational time.
  • Automatic mixed precision : float16 or bfloat16 makes model training faster since the model are trained on half precisons.
  • Below is the code snippets where I implemented above techniques like gradient checkpointing, freezing embeddings and model layers to make finetuning possible in GPU efficient way.
  • 📚 Resources How to finetune LLM part 2

Day193

  • Today I dive deep into understanding QLORA which allows efficient finetuning of LLM model while preserving performance. It made possible training even 20B LLM in google colab. Also I breaked down the steps of using qlora using Parameter Efficient Finetuning (PEFT).
    1. Use Bits and bytes to quantize your model into 4bit.
    1. Apply gradientcheckpointing and prepare k bit training to your model.
    1. Setup Loraconfig and use get_peft_model to get your parameter efficient finetuned model.
    1. Now your model is ready to be finetuned on efficient memory usage.
  • Below is the code snippet of finetuning LLM model using QLORA hope you have some good time finetuning your LLM model in your available setup made possible by QLORA.

Day194

  • Today I connected things that I learned previously like quantizing LLM model to 4 bit and using LORA to finetune whole model with peft, so I finetuned Llama2 7B on google colab which was awesome for me.
  • Also I used hugging face training args and trainer to make training more simple and efficient. Below is the code snippet of finetuning LLama2 7B on alpaca datasets.

Day195

  • Today I learned more on chat template , using different format than the format on which chat model is trained on is the silent performance degradation so making sure the correct format is used is necessary for better performance which is where chat template comes in.
  • Chat Template : Chat templates are Jinja template strings that are saved and loaded with your tokenizer, and that contain all the information needed to turn a list of chat messages into a correctly formatted input for your model.
  • Below is some chat template format to use :

Day196

  • Today I learned about interesting topics to handle the edge case while building RAG systems i.e maximal marginal relevance (MMR)
  • MMR selects examples based on a combination of which examples are most similar to the inputs, while also optimizing for diversity. It first select the most similar one then penalizes other similar case to create diversity.
  • Below is the sample example of mmr hope you gain some insight from it.

Day197

  • Today I wrapped a project by Deeplearning AI i.e Langchain with your data where I revised my concepts of langchain and got to know some valuable edge case handling techniques like :

Some edge cases and their handling in langchain

  • Avoiding duplicated response and adding diversity to response using Maximal Relevance Retrieval.
  • Getting response only from the data specified by user using metadata and SelfQueryRetriever.
  • Contextual compression for getting better responses by shirnking the context to relevant compressed data.
  • Getting response to follow up question using memory.
  • 📚 Resources : Langchain with your data

Day198

  • Today I learned about running LLM locally using LM studio which is UI based application that allows to run LLM locally on your pc . It is quite handy and easy to use.
  • I downloaded and use phi-2 a 3B Small language model with 2 bit quantization which was decent with simple tasks.
  • Also using LM studio you can check the performance of your finetuned LLM very easily.
  • Below is the snippet of comparing phi-2 2bit vs 4bit where 4bit done well but 2 bit perform badly where resources wise 2bit was efficient as expected
  • 📚Resources: LLM course

Day199

  • Today I worked on developing a full cost free local document Question Answering bot where I used hugging face embeddings and Llamacpp , After that I created a rag chain to answer from my documents at last a proper local data QA bot was developed.
  • Below is the code snippet for developing your own local QA bot free of cost.
  • I will be working to make it more robust in coming days.

Day200

  • Today I wrapped up my local document bot that can read local files and give reply to the query from documents. Thanks to Mitko Vasilev for inspiring to use local models and it feel great to own you local LLM.
  • I used opens source model phi2 3b with llama cpp python to run locally
  • I used hugging face embeddings
  • Also use langchain directory loaders to load files.
  • Below is the code and sample of Docbot I developed I will be making it conversable in coming days.
  • Docsbot

Day201

  • Today I started reading and implementing Large Language Models finetuning from Maxime Labonne blogs where I learned about difference between Supervised Fine Tuning(SFT) and RLHF(Reinforcement Learning from Human Feedback) by use:

  • SFT : Models are trained on labelled data and weight will be adjusted to minimize the loss

  • RLHF : Model learn by recieving feedback from human, a reward model train on human prefrences is used to optmize agent's policy using reinforcement learning through algorithm Like PPO.

  • While SFT dataset with similar format to prompt brings greater results.

  • Below is the sample diagram showing how RLHF works

  • 📚Resources: LLM course

Day202

  • Today I dive deep into the Mistral7B where I learned about sliding window attention and its benefit over vanilla attention also started to prepare datasets for finetuning Mistral7B .
  • Sliding Window attention : It has fixed size window that slides instead of linearly increasing window size in vanilla which reduces computational power and lower latency in inference.
  • Rolling buffer cache in Mistral overwrites old token with new tokens when cache size is overtaken.

Preparing datasets for finetuning tips.

  • Removing samples with larger token size accoriding to context size of model you are finetuning.
  • Removing similar output samples using embedding to measure similarity.
  • Sampling top k datasets with larger number of tokens
  • Having datasets in suitable template format of model you are training.

Notebooks for preparing datasets

preparing datasets to finetune LLM

  • 📚Resources: LLM course
  • Day203

  • Today I dive deep into finetuning Mistral7B on the 1k guanaco dataset that I prepared . I simply finetuned Mistral7b model on google colab by quantizing model to 4 bit and Using Lora, PEFT and SFT trainer . AT last merge the base model and finetuned model and pushed the finetuned model to hugging face hub.
  • Below is the code snippet of merging LoRA with the base model using peft library
  • 📚Resources: LLM course

Day204

  • Today I published Mistral 7B after finetuning it on gucano 1k dataset in google colab where I used 4bit quantization precision and was able to fine tune the model using 13.1 GB of gpu for 1hr and 36 min.
  • I have also published the notebook of finetuning Mistral 7b you can checkout Finetuning Mistral 7b
  • I have also published the finetuned model on hugging face you can check. Finetuned Mistral 7B model
  • 📚Resources: LLM course

Day205

  • Today I wrapped up my finetuning session of Mistral 7B on google colab by merging Lora wieghts with base models and final model was published to hugging face.
  • Check out all details of models and notebooks on this repository - > Finetuning Mistral7B
  • 📚Resources: LLM course

Day206

  • Today I learned about the KV cache techniques in detail with implementation which is one of the ways for reducing inference time in LLM by reducing subsequent token generation time.

  • In simple words during token generation past key values is also returned with token id then the token id with past key values is passed to model as input which reduced redundant calculation of key values and include some concatenations only.

  • Code :

  • 😁 Check out the decrease in subsequent token generation time represented by yellow for KV cache.

  • 📚Resources : Efficiently Serving LLMs

Day207

  • Today I learned about Batching tokens in LLM for higher throughput in LLM inference.
  • Batching : Batching tokens result in higher throughput but also increase latency so a good tradeoff latency and throughput can be maintained by selecting optimal batch size.
  • 🥳Checkout the Plot of tradeoff between throughput and latency with respect to batch size and which batch size would you select ?
  • 📚Resources : Efficiently Serving LLMs

Day208

  • Today I learned about Continuous Batching which one of the method that has minimal latency and higher througput also spended some time reading about the composition in langchain like tool and customizing default tools.
  • Continous Batching : The core idea behind it is that continuous batching swap out requests from the batch that have completed generation for requests in the queue that are waiting to be processed.
  • Below is the implementation of my learning about tool, toolkit and defining custom tools.
  • 📚Resources : Efficiently Serving LLMs

Day209

  • Today I dive deep into understanding the details of Parameter Efficient Fine Tuning(PEFT) using LORA and QLORA which reduces the memory requirement by 90% with achieving the performance close to the full finetuning.

For Full finetuning 7B models we need to allocate per parameter

  • 2 bytes for the weight
  • 2 bytes for the gradient
  • 4 + 8 bytes for the Adam optimizer states

i.e 7b * 16 = 112 GB(excluding intermediate hidden states) which makes finetuning less accesible so here comes PEFT methods.

  • PEFT methods reduces the trainable parameters of models without degrading performance.

While using QLORA with adam optimizer using a base model and mixed precision method mode, we need to allocate per parameter as :

  • ~0.5 bytes for the weight
  • 2 bytes for the gradient
  • 4 + 8 bytes for the Adam optimizer states

14 bytes per trainable parameter where trainable parameter is reduced with QLora

Image of peft method