picturebook.ai

Todo

Extract Data (✅)
Train GPT2
Build an API for GPT2 and Diffusers (✅, GPT part left).

Process involved in this:

Data Extraction

As for the dataset, we use the following websites:

for English, extracted the data from the Gutenberg Website.
- Used the dataset by mateibejan to extract the txt files.
- We took up a subset of the books listed in the dataset.
For Tamil, extracted the data from Siruvarmalar and the Oscar/unshuffled_deduplicated_ta dataset for adding more to the corpus and pretraining.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
pretrained-tokenizers		pretrained-tokenizers
src		src
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
dev-requirements.txt		dev-requirements.txt
pretrained-tokenizers.zip		pretrained-tokenizers.zip
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

pretrained-tokenizers

pretrained-tokenizers

src

src

.DS_Store

.DS_Store

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

dev-requirements.txt

dev-requirements.txt

pretrained-tokenizers.zip

pretrained-tokenizers.zip

requirements.txt

requirements.txt

Repository files navigation

picturebook.ai

Todo

Process involved in this:

Data Extraction

About

Releases

Packages

Languages

aneesh-aparajit/picturebook.ai

Folders and files

Latest commit

History

Repository files navigation

picturebook.ai

Todo

Process involved in this:

Data Extraction

About

Topics

Resources

Stars

Watchers

Forks

Languages