Skip to content

Final project of Rahnema College internship. Crawler detection using unsupervised learning methods.

License

Notifications You must be signed in to change notification settings

zivdar001matin/web-crawler-detection

Repository files navigation

Web Crawler Detection

Introduction

Machine Learning hybrid approach for detecting web crawlers (anomaly detection). Used MLflow in order to run as a web service. This is final project of Rahnema College machine learning internship. Data used to fit the model was given by Sanjagh server log. In order to prevent malicious usages in future, we don't publish the data but you can use generated server log for fitting the model.

Built With

Getting Started

The project is written used by sklearn.pipeline.Pipeline(). LogTransformer() preprocesses log data, PCAEstimator() predict using pca and then finally predict output using RuleBasedEstimator() by written rules. These are the steps to run the project on local machine.

Prerequisites

In order to install requirements using pip, run this command:

$ pip install -r requirements.txt

Running the Code

To fit the pca run this command:

(to fit autoencoder you can replace pca.py with autoencoder.py)

$ python pca.py

in order to run the model's API on local host: (If you don't have the required modules for the file and would like to create a conda environment, remove the argument --no-conda.)

$ mlflow models serve -m mlruns/0/MODEL_RUN_ID/artifacts/model/ -p 8000 --no-conda

or you can use pretrained model using real data:

$ mlflow models serve -m mlruns/0/2ec85ec6bfa74757835225e334311a3e/artifacts/model/ -p 8000 --no-conda

Send Request

You can use other formats in order to send data, see also Deploy MLflow Models.

$ curl	--location --request POST '127.0.0.1:8000/invocations' \
	--header 'Content-Type: text/csv' \
	--data-binary './datasets/test.csv'

Project Description

Anomaly detection is one of the most popular machine learning techniques. In this project, we are asked to identify abnormal behaviors in a system, which relies on the analysis of logs collected in real-time from the log aggregation systems of an enterprise. This is server log format

IP [TIME] [Method Path] StatusCode ResponseLength [[UserAgent]] ResponseTime

and this is a generated sample

42.236.10.125 [2020-12-19T15:23:10.0+0100] [GET / http://baidu.com/] 200 10479 [["Mozilla/5.0 (Linux; U; Android 8.1.0; zh-CN; EML-AL00 Build/HUAWEIEML-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108]] 32

Preprocessing

Data preprocessing has two main parts, feature extraction and feature transformation.

Feature Extraction

tehran_traffic_statistics: Based on Tehran-IX Packets/Second in different hours of daytime, gives a weight to each request. It is highly crawler in hours like 6 a.m.

url_depth: Depth of the request URL. is_spider: Extracted from user agent, specify that request is from a bot. is_phone: Extracted from user agent, specify that request is from a phone or a PC.

Feature Transformation

In feature transformation we used three different methods, bucketing, normalization and one-hot encoding. response_length: Bucketing on a log scale using np.geomspace() into ['zero', 'small', 'medium', 'big'] scales. requested_file_types: Bucketing into ['img', 'code', 'renderable', 'app', 'video', 'font', 'endpoint'] data types. status_code: Bucketing into ['is_1xx', 'is_2xx', 'is_3xx', 'is_4xx', 'is_5xx'] status codes. method: One-hot encoding request methods. time_weight: Normalized extracted tehran_traffic_statistics.

Learning Algorithm

PCA

PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality. Using PCA, you can reduce the dimensionality data and reconstruct the it. Since anomaly show the largest reconstruction error, abnormalities can be found based on the error between the original data and the reconstructed data. Here is a reconstructed sample and calculated error.

Isolation Forest

Isolation forest works on the principle of the decision tree algorithm. Due to the fact that anomalies require fewer random partitions than normal normal data points in the data set. So anomalies are points that have a shorter path in the tree.

Autoencoder

Let's get deeper. An autoencoder is a special type of neural network that copies the input values to the output values. It does not require the target variable like the conventional Y, thus it is categorized as unsupervised learning.

We also plot ROC curve for models to select the best model and an appropriate threshold.

We chose PCA for the final model, but can we follow some rules to get a better model? Let's use hybrid approach.

Hybrid Approach

In hybrid approach we combined PCA with rule based model. We added two rules, requests that says they are spiders and requests that use known malicious IPs. Evaluation shows that the model is good enough.

API

API was implemented using MLflow. MLflow is an open source platform to manage the ML lifecycle. You can see API example in this video. Sent requests using Postmman.

Contact

Our Team:

Special thanks to our supportive and knowledgeable mentor Tadeh Alexani - @tadeha and Rahnema College Team.

About

Final project of Rahnema College internship. Crawler detection using unsupervised learning methods.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages