What is FakeLake ?

Table of Contents

About The Project
Installation
Usage
Contributing
License

What is FakeLake ?

FakeLake is a command line tool that generates fake data from a YAML schema. It can generate millions of rows in seconds, and is order of magnitude faster than popular Python generators (see benchmarks).

FakeLake is actively developed and maintained by SOMA in Paris 🦊.

flowchart TD

subgraph Z["How it works"]
direction LR
  Y[YAML file description] --> F
  F[FakeLake] --> O[Output file in CSV, Parquet, ...]
end

Any feedback is welcome!

Features

Very fast
Easy to use
Small memory footprint
Small binary size
Robust / no unsafe code
No dependencies
Cross-platform (Windows, Linux, Mac OS X)
MIT license

Built with

Benchmark

Benchmark of FakeLake, Mimesis and Faker:

Goal: Generate 1 million rows with one column: random string (length 10)
Specs: Windows, AMD Ryzen 5 7530U, 8Go RAM, SSD

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`fakelake generate bench\fakelake_input.yaml`	252.8 ± 3.3	249.0	260.0	1.00
`python bench\mimesis_bench.py`	3374.9 ± 21.3	3353.0	3426.2	13.35 ± 0.19
`python bench\faker_bench.py`	13552.7 ± 340.5	13336.4	14446.4	53.62 ± 1.52

Build the benchmark yourself with scripts/benchmark.sh

Installation

Simple way : With precompiled binaries

Download the latest release from here

$ tar -xvf Fakelake_<version>_<target>.tar.gz
$ ./fakelake --help

From source

$ git clone
$ cd fakelake
$ cargo build --release
$ ./target/release/fakelake --help

How to use it

Generate from one or multiple files

$ fakelake generate tests/parquet_all_options.yaml
$ fakelake generate tests/parquet_all_options.yaml tests/csv_all_options.yaml

The configuration file used contains a list of columns, with a specified provider (for the column behavior), as well as some options. There is also an info structure to define the output.

columns:
  - name: id
    provider: Increment.integer
    start: 42
    presence: 0.8

  - name: company_email
    provider: Person.email
    domain: soma-smart.com

  - name: created
    provider: Random.Date.date
    format: "%Y-%m-%d"
    after: 2000-02-15
    before: 2020-07-17

  - name: name
    provider: Random.String.alphanumeric

info:
  output_name: all_options
  output_format: parquet
  rows: 1_234_567

Providers

A provider follows a naming rule as "Category.<optional sub-category>.provider".
Few examples:

Person.email
Increment.integer
Random.String.alphanumeric

Options

There is two types of options:

Options linked to the provider (date and format)
Options linked to the column (% presence)

Generation Details

There is three optional fields:

output_name: To specify the location and name of the output
output_format: To specify the generated format (we support Parquet and CSV for now)
rows: To specify the number of rows to generate

Contributing

Contributions are welcome! Feel free to submit pull requests.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE.txt for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
bench		bench
docs		docs
images		images
scripts		scripts
src		src
static		static
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
README.md		README.md
mkdocs.yml		mkdocs.yml

License

soma-smart/Fakelake

Folders and files

Latest commit

History

Repository files navigation

What is FakeLake ?

Features

Built with

Benchmark

Installation

Simple way : With precompiled binaries

From source

How to use it

Providers

Options

Generation Details

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages