Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning DE Roadmap #49

Open
JPHaus opened this issue Dec 24, 2023 · 5 comments
Open

Learning DE Roadmap #49

JPHaus opened this issue Dec 24, 2023 · 5 comments
Assignees
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers help wanted Extra attention is needed

Comments

@JPHaus
Copy link
Collaborator

JPHaus commented Dec 24, 2023

A FAQ in the community is a structured roadmap for learning Data Engineering and it's about time we start addressing it. We currently have a getting started guide but it's not detailed enough and was meant to be improved on anyways.

It can be a complex question to answer but we can simplify it by adding a few constraints. Since the majority of folks asking are those who are new to DE or trying to transition we should focus on skills for junior/entry level and mid level roles. While there aren't many jr roles at the moment it can still be useful to make the distinction for foundational skills. To make it as general as possible, I believe we should exclude tools/requirements that only apply to FAANG-like companies since they are more niche and oftentimes FAANG companies have developed their own internal tooling to solve their unique problems. Finally, the focus should be on core concepts instead of tooling. While we can include specific tools, we should try to avoid directly recommending specific tools and instead point learners to pages that have lists of the current popular tools to keep this resource as evergreen as possible (example: workflow orchestration popular tools).

While I don't believe a diagram is a requirement, I do think it could be helpful if we can get it to render nicely in mermaid because we can then make it interactive and link to other notes in the wiki like we do with other diagrams. The canvas feature for Obsidian publish is not yet supported so we would probably use a mermaid flowchart for now.

Existing popular roadmap shared in the community:

For V1, please share any thoughts/ideas/constructive criticism on the structure and core concepts. I'll start a new branch after Christmas and start something we can work from.

@JPHaus JPHaus added documentation Improvements or additions to documentation help wanted Extra attention is needed good first issue Good for newcomers labels Dec 24, 2023
@JPHaus JPHaus self-assigned this Dec 24, 2023
@gr8web
Copy link

gr8web commented Dec 25, 2023

https://awesomedataengineering.com/

@oguzhangur96
Copy link

Here are couple of ideas:

  • Anything that is not a core data concept should be after core concepts. Ex. Anything related to testing should be way after SQL. You should not be writing integration or even unit tests if you cant build a working pipeline.
  • Core concepts can be defined as: Concepts that can be found most of the junior/mid level data engineers day-to-day tasks
    1. SQL
    2. Programming Language
    3. Storage (S3 / Harddrive)
    4. OLAP / OLTP Database
    5. Orchestrator (I would even put this as three, it is core of batch processing which is most of the industry)
    6. Basic data modelling (data types, partitions, maybe primary, foreign, unique keys and indices)
    7. Git
      So no Streaming, NoSQL, MapReduce, Hadoop etc.
  • After core concepts, more advanced and less used one can be given such as:
    1. Distributed Computing, Hadoop, MapReduce
    2. Spark
    3. NoSQL (Just to understand more dist. computing and to pull / push data)
    4. Docker and basic infrastructure and networking
    5. Advanced data modelling (Snowflake, Star schema, One Big Table)
    6. Streaming (I believe this is way too advanced for beginners)
    7. Anything that I missed or more related to general software engineering (Tests, CI/CD etc.)
    8. Cloud and advanced infrastructure (maybe Kubernetes)

@sdairs
Copy link

sdairs commented Dec 26, 2023

Like anything else, there's basic and advanced streaming, it's not a niche skill nor is it gated to senior engineers. IMO It's a huge disservice for a junior level not to have basic familiarity with where the industry is heading.

@JPHaus
Copy link
Collaborator Author

JPHaus commented Feb 5, 2024

I just created a page to start playing around with. I figured we'd start with entry level -> mid level roles since that's what most people are looking for. I'll add more to this later but feel free to make edits. https://github.com/data-engineering-community/data-engineering-wiki/tree/49-roadmap
image

@gr8web
Copy link

gr8web commented Feb 6, 2024

I just created a page to start playing around with. I figured we'd start with entry level -> mid level roles since that's what most people are looking for. I'll add more to this later but feel free to make edits. https://github.com/data-engineering-community/data-engineering-wiki/tree/49-roadmap image

In my opinion databases should be part of the computer science
"computer resources" would probably just be called "Operating Systems"
So computer science fundamentals would be something like:

Operating Systems -> Networks -> Databases or ->Databases->Networks

But generally not sure if the whole thing is not just reinventing the wheel at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

4 participants