Skip to content

AnanthaRajuC/DataPractitioner

Repository files navigation

Data Practitioner

contributions welcome Tweet Twitter Follow

Built with ❤︎ by Anantha Raju C and contributors

Explore the docs »

Report Bug · Request Feature

Service Badge Badge Badge Badge Badge
GitHub GitHub last commit GitHub pull requests GitHub issues GitHub forks GitHub stars
GitHub GitHub repo size GitHub top language GitHub code size in bytes GitHub tag (latest SemVer) GitHub language count

This GitHub project is a data engineering and analytics pipeline designed to handle the end-to-end process of extracting, transforming, and loading data from MySQL source into ClickHouse. The pipeline is orchestrated using Dagster, a data orchestrator that provides a unified workflow for managing data pipelines.

The combination of Dagster, ClickHouse, DBT Core, and MySQL ensures a well-structured and maintainable architecture for end-to-end data processing.

Key Components:

  1. Dagster: The core orchestrator that manages the workflow of the entire data pipeline. Dagster allows for the definition, scheduling, and monitoring of data workflows, ensuring reliability and scalability.

  2. ClickHouse: A columnar database used as the data warehouse for efficient storage and retrieval of large volumes of data. ClickHouse is optimized for analytical queries, making it suitable for data analytics and reporting.

  3. DBT Core: The data transformation layer that leverages the popular DBT (Data Build Tool) framework. DBT Core facilitates the transformation of raw data into a structured and meaningful format for analytics and reporting.

  4. MySQL: Used for data extraction and as a source database. MySQL plays a crucial role in the initial phase of the pipeline.

Workflow:

  1. Data Extraction: Raw data is extracted from MySQL databases, serving as source systems. This could include data from various operational databases.

  2. Loading: The transformed data is loaded into ClickHouse, the designated data warehouse, where it is stored efficiently for analytical queries and reporting.

  3. Transformation: DBT Core processes and transforms the raw data into a clean, structured format suitable for analytics. Transformations may include aggregations, joins, and other operations to derive insights.

  4. Analytics: Once the data is in ClickHouse, analysts and data scientists can perform analytics and generate insights using SQL queries or other analytical tools.

How to Use:

Detailed documentation and instructions on setting up and configuring the pipeline are available in the project repository. Users can follow the guidelines to adapt the pipeline to their specific data sources and analytics requirements.

Details

Reporting Issues/Suggest Improvements

This Project uses GitHub's integrated issue tracking system to record bugs and feature requests. If you want to raise an issue, please follow the recommendations below:

  • Before you log a bug, please search the issue tracker to see if someone has already reported the problem.
  • If the issue doesn't already exist, create a new issue
  • Please provide as much information as possible with the issue report.
  • If you need to paste code, or include a stack trace use Markdown +++```+++ escapes before and after your text.

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Kindly refer to CONTRIBUTING.md for important Pull Request Process details

  1. In the top-right corner of this page, click Fork.

  2. Clone a copy of your fork on your local, replacing YOUR-USERNAME with your Github username.

    git clone https://github.com/YOUR-USERNAME/DataPractitioner.git

  3. Create a branch:

    git checkout -b <my-new-feature-or-fix>

  4. Make necessary changes and commit those changes:

    git add .

    git commit -m "new feature or fix"

  5. Push changes, replacing <add-your-branch-name> with the name of the branch you created earlier at step #3. :

    git push origin <add-your-branch-name>

  6. Submit your changes for review. Go to your repository on GitHub, you'll see a Compare & pull request button. Click on that button. Now submit the pull request.

That's it! Soon I'll be merging your changes into the master branch of this project. You will get a notification email once the changes have been merged. Thank you for your contribution.

Kindly follow Conventional Commits to create an explicit commit history. Kindly prefix the commit message with one of the following type's.

build : Changes that affect the build system or external dependencies (example scopes: gulp, broccoli, npm)
ci : Changes to our CI configuration files and scripts (example scopes: Travis, Circle, BrowserStack, SauceLabs)
docs : Documentation only changes
feat : A new feature
fix : A bug fix
perf : A code change that improves performance
refactor: A code change that neither fixes a bug nor adds a feature
style : Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
test : Adding missing tests or correcting existing tests

License

Distributed under the MIT License. See LICENSE.md for more information.

The End

In the end, I hope you enjoyed the application and find it useful, as I did when I was developing it to learn.

If you would like to enhance, please:

  • Open PRs,

  • Give feedback,

  • Add new suggestions, and

  • Finally, give it a 🌟.

  • Happy Coding ...* 🙂

Contact

Anantha Raju C - @anantharajuc - arcswdev@gmail.com

Project Link: https://github.com/AnanthaRajuC/DataPractitioner