Skip to content

PyTorch based library supporting research discussed in https://arxiv.org/abs/2209.05152

License

Notifications You must be signed in to change notification settings

FatherShawn/cp_learning

Repository files navigation

Censored Planet Processes and Models for Machine Learning

DOI

Overview

This repository contains the code written to support the research for my thesis Finding Latent Features in Internet Censorship Data. The thesis was further refined and subsequently published as Detecting Network-based Internet Censorship via Latent Feature Representation Learning and a preprint is available at arxiv.org

The machine learning models are build with Pytorch extended by PytorchLightning. Logging was set up to use Comet. If you wish to use a different logger that can easily be swapped in your instance of this code.

Process Diagrams

Building Datasets

The Censored Planet data needs to be transformed into datasets that can be used with our models. I built my base dataset by ingesting one large CP_Quack-echo-YYYY-MM-DD-HH-MM-SS.tar file at a time to accommodate the speed and stability of my computing environment. My data was taken from 5 days in the summer of 2021. I believe that the structure of their data as since changed so you likely will need to refactor cp_flatten.py if you are using newer data.

flowchart TD
    A[/Quack tar file/] -->B(cp_flatten_processor.py)
    B --> C[/Pickled Dictionary<br>Stored at indexed path/]
    C -- iterate --> B
    B --> D[/Create or update<br>metadata.pyc/]
    D --> E(Single tar file processed)

The flattened and vectorized data is stored as pickled dictionaries using an indexed directory structure under the specified output directory

flowchart TD
    A[Dataset dir] --- 0
    A --- 1
    A --- 2
    A --- B[...]
    A --- m
    A --- i[/metadata.pyc/]
    2 --- 2-0[0]
    2 --- 2-1[1]
    2 --- 2-2[2]
    2 --- 2-c[...]
    2 --- 2-99[99]
    2-2 --- 220[/202000.pyc/]
    2-2 --- 221[/202001.pyc/]
    2-2 --- 222[/202002.pyc/]
    2-2 --- 22c[/.../]
    2-2 --- 229[/202999.pyc/]

These dictionary files are used in the remainder of the project via QuackIterableDataset found in cp_dataset.py. This iterable dataset is managed using QuackTokenizedDataModule.

For the image based model, this data is accessed via QuackTokenizedDataModule and stored in a two new datasets by cp_image_reprocessor.py using a similar directory tree in which each leaf directory stores a PNG image file and a pickle file of the encoded pixels and metadata. The first image dataset is balanced between censored and uncensored for training the replacement classifier layer in DenseNet. The second set are all the undetermined records.

Building Embeddings

The flattened and tokenized data is used to train the autoencoder

flowchart TD
    A[QuackIterableDataset] --> B[QuackTokenizedDataModule]
    B --> C(ae_processor.py)
    C --iterate--> B
    C --> D[trained QuackAutoEncoder]

The trained autoencoder model is captured and used as in additional input to ae_processor.py to process the data into two sets of embeddings. One set is labeled and balanced between censored and uncensored for training the classifier. The second set are embeddings of the undetermined records.

flowchart TD
    J[/trained QuackAutoEncoder/] --> M
    K[QuackIterableDataset] --> L[QuackTokenizedDataModule]
    L --> M
    M(ae_processor.py) --> N[AutoencoderWriter]
    N --> O[/.pyc file in indexed directory/]

These two datasets of embeddings are managed with QuackLatentDataModule.

Classification

Classification as censored or uncensored is the core task of this work. There are two classification processes built in this repository. latent_processor.py both trains a QuackLatentClassifier using a set of labeled embeddings and uses a trained QuackLatentClassifier to classify undetermined embedding as either censored or uncensored. dn_processor.py both trains a QuackDenseNet using a labeled set of image data and then uses the trained QuackDenseNet to classify undetermined image data as either censored or uncensored.

Job Scripts

Our data processed in the CUNY HPCC which uses SLURM to manage jobs. Figuring out how to configure for SLURM was a challenge. An additional challenge was that Pytorch no longer supported the older GPUs we had available, so we needed to train in parallel on CPU. I eventually solved parallel processing on that architecture by using the Ray parallel plugin. These job scripts also contain setup for this plugin. I've left them here as I had trouble finding examples. Your computing environment is almost certainly different and that will cause further changes in your instance of this code.

Documentation

This documentation is presented in markdown that was generated from the docstrings within each python module. It may be found in the docs directory here in the repository.