Skip to content

aneesh-aparajit/picturebook.ai

Repository files navigation

picturebook.ai

Todo

  1. Extract Data (✅)
  2. Train GPT2
  3. Build an API for GPT2 and Diffusers (✅, GPT part left).

Process involved in this:

Data Extraction

As for the dataset, we use the following websites:

  1. for English, extracted the data from the Gutenberg Website.
    • Used the dataset by mateibejan to extract the txt files.
    • We took up a subset of the books listed in the dataset.
  2. For Tamil, extracted the data from Siruvarmalar and the Oscar/unshuffled_deduplicated_ta dataset for adding more to the corpus and pretraining.