This project is a python recreation of the system behind Facebook marketplace, which uses AI to recommend the most relevant listings based on a personalised search query.
This project requires Pillow, Sklearn, Pytorch, Tensorboard and FastAPI.
Note that due to hardware limitations, the aim of this project is to practice Pytorch and API deployment rather than producing an accurate AI prediction model.
I downloaded the text and image dataset in .csv and .jpg format respectively. Then I performed some preliminary cleaning of the dataset.
To clean the tabular dataset I converted the prices of the products to np.float64 objects by removing the pound signs and the commas. To clean the image dataset, I made use of the Pillow library to resize all images to 128x128(px) size and standardized them to RGB mode.
Firstly I created a dataset that feeds entries to the model. In image_dataset.py
I defined a class which inherits from torch.utils.data.Dataset
, which reads all the images and matches them to their corresponding product ids and categories. Note that we must define __len__
and __getitem__
methods. The __getitem__
method finds and loads an image, and then return the image as a torch tensor together with its label.
Additionally, I added image transformations to the dataset. Each image may be horizontally flipped or slightly distorted by a certain probability. This will artifically generate more datapoints for the model to learn and hopefully improve its performance against test datasets with new, unfamiliar images.
def __getitem__(self, idx):
image_path = os.path.join(self.image_dir, f"{self.image_table.iloc[idx, 0]}.jpg")
image = Image.open(image_path)
image = self.transform(image)
label = self.image_table['category'][idx] # Category this image is in
label = self.encoder[label] # Encode string label to int
return image, label
Next, I try to build a convolutional neural network (CNN) model that predicts the category of each image. We make use of the torch.nn.Sequential
module to construct our model by stack a sequence of layers, using ReLU as the activation function. Note that we have 3 input feature (for RGB channels respectively) and 13 output features (corresponding to the 13 different categories we have).
class CNN(torch.nn.Module):
def __init__(self):
super().__init__()
self.layers = torch.nn.Sequential(
torch.nn.Conv2d(3, 8, 5, stride = 2, padding = 1),
torch.nn.ReLU(),
torch.nn.Conv2d(8, 16, 5, stride = 2),
torch.nn.ReLU(),
torch.nn.Conv2d(16, 32, 5, stride = 2),
torch.nn.ReLU(),
torch.nn.Flatten(),
torch.nn.Linear(5408, 512),
torch.nn.ReLU(),
torch.nn.Linear(512, 64),
torch.nn.ReLU(),
torch.nn.Linear(64, 13), # We have 13 categories
torch.nn.ReLU(),
torch.nn.Softmax(dim = 1)
)
Unfortunately my custom CNN model achieved an accuracy of only 13~15%, which is far from satisfactory given that blindly guessing corresponds to an accuracy of 1/13 (7.7%) for our dataset. To improve the model, we need to do a more comprehensive hyperparameter search, which requires lots of time and computing power.
Alternatively, we can use transfer learning to fine-tune an existing image classification model ResNet-50. We only need to replace the final layer so that the model returns 13 output features.
The model is then trained with our image dataset, where 70% of the images are used for training, 15% for validation and 15% for final testing. The optimizer used is SGD (stochastic gradient descent) and the loss function is calculated with torch.nn.functional.cross_entropy
. I set the initial learning rate to be 0.08, which is halved every 3 epochs by the StepLR scheduler: lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size = 3, gamma = 0.5)
.
I used Tensorboard to visualize the loss functions for each run. Below are the loss functions for the training set and the validation set over 32 epochs.
The final accuracy of the model is 48.5%. This is in the context of using 128x128(px) sized images because of hardware limitations. A possible extension to the project is to create a text comprehension model on the description of each product, and then integrate it with the vision model in order to further improve the accuracy.
I created an API serving our model with FastAPI. FastAPI automatically generates a doc page which can be accessed through serverURL/docs
(for our case http://127.0.0.1:8080/docs
), where all get and post methods are listed. In the /predict/image
post method, I made use of FastAPI's UploadFile
class so that users can upload an image from their local device. The method will return a prediction of the most likely category of the image, together with a dictionary containing the probabilities for each category.
Finally, I built the API into a docker image and pushed it to Dockerhub.
-
If you have a Nvidia GPU on your device, I highly recommend installing CUDA to get your GPU working on the training of the vision model and speed the learning process up. See Pytorch website for a full guide.
-
In
train.py
, you can also continue training the model from a checkpoint. Note that thestate_dict
of BOTH the model and the optimizer must be loaded.