gpt4all-datalake

An open-source datalake to ingest, organize and efficiently store all data contributions made to gpt4all.

Architecture

The core datalake architecture is a simple HTTP API (written in FastAPI) that ingests JSON in a fixed schema, performs some integrity checking and stores it. This JSON is transformed into storage efficient Arrow/Parquet files and stored in a target filesystem.

Data formats

Data is stored on disk / S3 in parquet files in subdirectories organized by day. These parquet files have a standardized schema allowing for easy manipulation in any programming language.
The input data model can be found here.

Open sourcing the data.

Nomic AI will provide automatic snapshots of this raw parquet data. You will be able to interact with the snapshots:

In their raw exported form.
In automatic Atlas maps over its raw, cleaned and curated form.
Through downloads where the data has been curated, de-duplicated and cleaned for LLM training/finetuning.

Data Privacy

By sending data to the GPT4All-Datalake you agree to the following.

Data sent to this datalake will be used to train open-source large language models and released to the public. There is no expectation of privacy to any data entering this datalake. You can, however, expect attribution. If you attach a unique identifier that associates you as the data contributor, Nomic will retain that identifier in any LLM trains that it conducts. You will receive credit and public attribution if Nomic releases any model trained on your submitted data. You can also submit data anonymously.

Where does the gpt4all-datalake run?

While open-sourced under an Apache-2 License, this datalake runs on infrastructure managed and paid for by Nomic AI. You are welcome to run this datalake under your own infrastructure! We just ask you also release the underlying data that gets sent into it under the same attribution terms.

Development

Clone down the repository.
Run make testenv to build all docker images and launch the HTTP server.
Go to 'http://localhost/docs' to view the API documentation.
You can run the unit tests with make test. Any edits made to the FastAPI app will hot reload.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
api		api
client		client
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api

api

client

client

tools

tools

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

docker-compose.yaml

docker-compose.yaml

makefile

makefile

Repository files navigation

gpt4all-datalake

Architecture

Data formats

Open sourcing the data.

Data Privacy

Where does the gpt4all-datalake run?

Development

About

Releases

Packages

Contributors 2

Languages

License

nomic-ai/gpt4all-datalake

Folders and files

Latest commit

History

Repository files navigation

gpt4all-datalake

Architecture

Data formats

Open sourcing the data.

Data Privacy

Where does the gpt4all-datalake run?

Development

About

Resources

License

Stars

Watchers

Forks

Languages