Skip to content

Data Construct-Populate-Access-Manage - Open source data warehouse solution.

License

Notifications You must be signed in to change notification settings

OrionExplorer/dcpam

Repository files navigation

DCPAM - Data Warehouse Solution

CodeQL C/C++ CI Build Status Build status Codacy Badge License: GPL v2

Copyright © 2020 - 2021 Marcin Kelar

Data Construct-Populate-Access-Manage

  • DCPAM goal is to deliver full range of Data Warehouse possibilities without need to include or hire more engineers for this task.
  • DCPAM architecture is highly flexible and provides unlimited scaling possibilities.
  • DCPAM is multiplatform solution: on-premise (Linux/Windows), cloud (Private, Microsoft Azure, Amazon Web Services, Google Cloud Platform) and hybrid deployments are possible.
  • DCPAM is hardware- and database-agnostic with component-based architecture.

Supported databases

PostgreSQL MySQL MariaDB Microsoft SQL Server Oracle Database SQLite3 IBM Db2 MongoDB ODBC

Other supported sources

XLSX MSACCESS XML ODS CSV TSV PSV JSON API

Deployment

Linux Windows 10 Cloud Docker Microsoft Azure Google Cloud Platform Amazon Web Services

Table of contents

Business Value

  • Open source, cost-effective.

  • DCPAM helps to create single central repository of integrated company data - this provides a single integrated view of an organisation.

  • All informations are always up to date - Managers can respond rapidly to ongoing changes in the business environment to make data-driven decisions.

  • Data structures are designed in a uniform way - much less effort is needed to prepare and access requested informations.

  • No wide range of advanced technical knowledge is needed to make DCPAM work - installation is straightforward and system configuration can be handled by any person with SQL background and analytic insight of company data.

  • Process gigabytes of data within minutes - benefit of parallel execution.

  • Flexible deployment - use your own infrastructure or get the benefits from cloud platforms (Private, Microsoft Azure, Amazon Web Services, Google Cloud Platform). DCPAM Data Warehouse Solution is hardware- and database-agnostic with component-based architecture.

Main Overview

Company data in DCPAM

DCPAM is responsible for copying data from one or more sources into a destination system. That process consists of three steps: Extraction, Transformation and Load, with Transform and Load in various configurations: Extract-Transform-Load (ETL) or Extract-Load-Transform (ELT).

Data in the Warehouse can be accessed directly at the database level with any system for analytics, such as Power BI, Tableau, Redash and others. For complex architectures with more than one DCPAM Database node, DCPAM Warehouse Data Server is dedicated data access point. For more details, please refer to DCPAM WDS documentation.

All of the DCPAM Data Warehouse operations are monitored by DCPAM Live Component State. See the documentation to read more on this subject.

Data Warehouse with DCPAM

As DCPAM is extremely modular and highly scalabe, it can serve both as Data Warehouse and number of dedicated Data Marts.

What DCPAM covers in terms of Data Warehousing?

  • JSON-based source systems configuration.
  • SQL and JSON-based configuration of the ETL processes:
    • Data extraction:
      • Inserted data
      • Deleted data
      • Modified data
    • Staging Area:
      • optional
      • placed locally in DCPAM Database
      • placed in external database
    • Data transformation:
      • optional
      • handled locally (in relation to the Staging Area)
      • handled remotely (in relation to the Staging Area)
    • Data load from:
      • local Staging Area
      • remote Staging Area
      • results of the Extract subprocess
  • Parallel execution:
    • By design:
      • Each ETL process runs in separate thread.
    • By running multiple instances of DCPAM:
      • On the same server
      • On many disparate servers
  • SQL and JSON-based preconfigured queries for data analysis.
  • Data Warehouse and Data Marts:
    • One or many instances of DCPAM can work as Data Warehouse (extracting and processing data from disparate sources).
    • In the same time different DCPAM instances can use Data Warehouse to feed Data Marts with specific business-oriented data.
  • Data sources:
  • Data Warehouse monitoring:
    • DCPAM ETL
    • DCPAM RDP
    • DCPAM WDS
    • DCPAM LCS
  • Access DCPAM Database with any system for analytics (Power BI, Tableau, Redash etc.).

Elements yet to be covered by DCPAM

  • DCPAM Admin web application:
    • Manage data sources.
    • Configure ETL processes.
    • Manage DCPAM BI users.
  • DCPAM Monitoring (web application for DCPAM LCS).
  • DCPAM LCS Notifications
  • Data sources:
    • LDAP
    • Cassandra

Other

  • Choose Data Warehouse DBMS, sufficient hardware and disk space.
  • Consider Data Warehouse tables schema:
    • Snowflake schema [1]
    • Star schema [2]
    • Galaxy schema
    • Fact constellation [3]
  • Project data structures:
    • Staging Area
    • Target tables
    • Indexes
    • Views
  • ...and much more.

Technology

DCPAM Data Warehouse Solution is hardware- and database-agnostic with component-based architecture.

DCPAM Components

DCPAM Data Warehouse Solution consists of a number of integrated components. Each component has unique log file, and DCPAM ETL - due to parallel execution - can create many log files.

DCPAM ETL - Extract-Transform-Load / Extract-Load-Transform

DCPAM ETL is the main ETL/ELT engine. Each running instance can handle many ETL/ELT processes simultaneously. Moreover, many instances of DCPAM ETL can work within single Data Warehouse.

DCPAM WDS - Warehouse Data Server

DCPAM WDS is dedicated Massively Parallel Processing service for querying predefined and custom business data with in-memory caching system. DCPAM WDS cluster encapsulates every database into a single data access point. This component can be used without entire DCPAM Data Warehouse Solution and operate as a stand-alone solution.

DCPAM RDP - Remote Data Processor

DCPAM RDP is used by DCPAM ETL to execute transform scripts/applications that must be run on separate machines when performance impact is significant. DCPAM ETL communicates with these remote scripts/applications through DCPAM RDP.

DCPAM Architecture overview

DCPAM workflow overview

DCPAM Transform process architecture overview Example of DCPAM Extract-Transform-Load / Extract-Load-Transform process scalability:

  • Multiple DCPAM ETL engine nodes can be run within single Data Warehouse.
  • Each DCPAM ETL instance can trigger unlimitend number of local and remote (through DCPAM RDP) data transformation scripts/applications.
  • Each DCPAM ETL instance can use dedicated Staging Area node (local or remote).
  • Many DCPAM Database nodes can be encapsulated into single data access point by DCPAM WDS.
  • DCPAM ETL, Staging Area, DCPAM Database and DCPAM WDS can run on single server as well!

Following diagrams represents different configurations to deploy both Data Warehouse and Data Marts.

DCPAM overview

Data Warehouse with DCPAM

DCPAM overview

Data Marts with DCPAM Data Warehouse

DCPAM LCS - Live Component State

DCPAM LCS is the central repository of informations about all the DCPAM Components current state. It is the most important DCPAM Component from an administrative point of view.

DCPAM Construct

DCPAM Construct is going to be the main system administration web application.

DCPAM Monitoring

DCPAM Monitoring is going to be web application for DCPAM LCS data presentation.

Data sources

DCPAM development is still in progress with following data sources available:

Databases

Data source Support
PostgreSQL native
MySQL 8 native
MariaDB/MySQL 5 native
SQL Server/Azure SQL Database native via ODBC*
Oracle Database native
SQLite3 native
IBM Db2 ODBC
Microsoft Access ODBC
MongoDB native

* SQL Server/Azure SQL Database: ODBC is the primary native data access API for applications written in C and C++ for SQL Server.

Please note that DCPAM provides support for every ODBC-compliant data source.

Flat Files

Data source Support
CSV native
TSV native
PSV native
JSON native
XLS script
XLSX script
XLSM script
XLSB script
ODF script
ODS script
XML script

API

DCPAM can access files and other resources from local or remote locations. The latter are fetched via HTTP/HTTPS protocol and Battery HTTP Server is recommended. Parameters such as URL, method, headers and additional payload are supported. Files are loaded to temporary tables in DCPAM or external database to make SQL operations possible for this kind of data.

DCPAM Database

DCPAM is designed to be as most customizable as it needs to be. Therefore every database listed above as available data source can also be used as DCPAM target database.

Docker images

Each DCPAM Component is provided with Dockerfile to build Docker image:

  • DCPAM ETL: dcpam-etl.dockerfile
  • DCPAM RDP: dcpam-rdp.dockerfile
  • DCPAM WDS: dcpam-wds.dockerfile
  • DCPAM LCS: dcpam-lcs.dockerfile
Currently under active development
  • DCPAM WDS: cache TTL.
  • DCPAM ETL: LDAP support.

Professional Services for DCPAM

Under development.

Roadmap

Year Quarter Feature Status
2020 Q3 ✅ Transform process Done (2020-08-07)
2020 Q3 ✅ DCPAM WDS Done (2020-08-17)
2020 Q3 ✅ DCPAM auth keys Done (2020-08-19)
2020 Q3 ✅ Data source: CSV/TSV/PSV Done (2020-08-30)
2020 Q3 ✅ Remote files Done (2020-09-01)
2020 Q3 ✅ Data source: JSON Done (2020-09-03)
2020 Q3 ✅ DCPAM LCS Done (2020-09-15)
2020 Q3 ✅ Docker images for all the DCPAM Components Done (2020-09-20)
2020 Q3 ✅ HTTPS support Done (2020-09-22)
2020 Q3 ✅ Data source: externally preprocessed Done (2020-09-24)
2020 Q3 ✅ Data source: XML Done (2020-09-24)
2020 Q3 ✅ DCPAM ETL connectivity tests Done (2020-09-25)
2021 Q1 ✅ Data source: APIs Done (2021-02-01)
2021 Q1 ✅ DCPAM WDS true Multi Parallel Processing Done (2021-02-02)
2021 Q1 ✅ DCPAM WDS: horizontal scaling Done (2021-02-11)
2021 Q1 ✅ DCPAM WDS: Advanced Query Cache Done (2021-02-14)
2021 Q1 ✅ Data source: MongoDB Done (2021-02-27)
2021 Q1 Data source: LDAP In progress
2021 Q1 DCPAM WDS: cache TTL In progress
2021 Q1 DCPAM ETL: report workflow failures to DCPAM LCS Not implemented
2021 Q1 Read complex SQL queries from files Not implemented
2021 Q2 DCPAM Monitoring Not implemented
2021 Q4 DCPAM Construct (admin web application) Not implemented
2022 Q1 DCPAM AI Platform Not implemented

This software uses: