From Data Analyst to Data Engineer: My 12 Month Self-Study Roadmap

Machine Learning


. Part of me started this journey because data engineering is currently one of the most popular and highest paying professions. I’m not saying that wasn’t a factor.

But that’s not all.

I’ve been learning data analysis for a while now. SQL, Power BI, Python (Pandas, NumPy, a little Polars), data cleaning, EDA. What can I say, I was in the weeds about it. And I really enjoy it. But somewhere along the line, I started to want to know what happens before the data hits my desk. How does it work? Who is building that pipeline? What does the infrastructure behind all this actually look like?

That curiosity planted a seed.

AI has since made many of the things I do faster and easier. That’s great. But it also got me thinking. If AI can handle analytics, what’s my advantage? What should I build and understand to get even deeper? I’m working as an IT systems analyst at a startup, and while I enjoy my work, I’ve found that I’m not challenging myself as much as I would like. I was ready for more.

The final impetus was a video by Data With Baraa where he laid out a complete data engineering roadmap. Seeing it structured and broken down made something feel real and doable. So here I am.

I am learning data engineering in public. And this article is the beginning of that journey.

I would also like to leave a disclaimer that I am not affiliated with Data with Baraa. I’m just sharing my personal journey. I hope this helps.

Why do data engineering in particular?

I think this question needs a real answer, so I’d like to spend a little time here.

In data analysis, you learned how to work with data after it arrives. Clean it, explore it, visualize it, and draw insights from it. That skill set is really valuable. However, the more I learned, the more I kept hitting the same wall. The data I was working with had already been shaped and moved by others. Someone built a pipeline to get it to me. Someone decided how it would be stored, how it would be structured, and how often it would be updated.

I wanted to be that person.

Data engineering is upstream from analytics. It’s about building a system that makes analysis possible in the first place. Data pipelines, storage architecture, workflow orchestration, and large-scale data processing. These are the foundations on which everything else is built. And to be honest, that kind of infrastructure work appeals to me in a way that pure analysis can no longer do.

There are also practical discussions. Data engineering roles consistently rank among the highest paying jobs in the data industry. As AI tools improve the automation of the analytics layer, the demand for talent who can build and maintain reliable data infrastructure will only increase. I’d rather build a pipe than just use one.

And one more thing. The startup I work for doesn’t use any of the tools I’m about to learn. This means that any time I spend on this is completely voluntary. There are no teams to learn from and no work projects to apply it to. It’s just me, the internet, and what I can build. It’s a challenge I intentionally chose.

Why do I do this in public?

Writing about what I’ve learned is something I already deeply believe. I need to actually understand something before I can explain it. It keeps you accountable. And over time, you build something that a resume alone could never have achieved.

But I’m also honest about my fears. Because I think that’s what it means to do this publicly.

I have Shiny Object Syndrome. So I said, I researched graphic design, animation, writing, marketing, and IT before coming to data. There’s always something new and exciting catching my attention. Data engineering can easily be replaced by the next fancy thing in your feed if you don’t intend it to.

Consistency is another issue. I work a 9-5 job and rarely touch the tools I’m about to learn. There are no natural reinforcements at work, and no colleagues to bounce Airflow questions off of. I’m building this completely on my own time away from my work responsibilities.

And balance. The standard is 3 to 4 hours a day. Some days it will feel easy. Other days it will feel impossible.

It is my responsibility system to publish this journey. If I get quiet, you’ll know I slipped. And I think it’s better not to slip.

what i start

It’s helpful because I’m not starting from scratch. I already have introductory to intermediate SQL knowledge, basic Python fundamentals, and hands-on experience with Pandas from my data analysis work. Doing so will give you a foundation to build on rather than rebuilding from scratch.

The entire learning stack is shown below. We will work on them in approximate order.

1. SQL: Deeper than analysis

I know SQL. However, analytical SQL and engineering SQL are different. Learn more about optimizing queries, creating indexes, working with very large datasets, and writing SQL that is built with performance in mind, not just exploration. If you’ve only ever used SQL to retrieve and filter data, understand that there’s a whole other layer underneath that’s worth understanding.

Why it’s first: Everything in data engineering ultimately impacts SQL. Doing a good job here before adding more complex tools will make the rest easier.

2. Python: From exploratory to production-ready

The basics are done. Pandas, NumPy and some Polars. However, the Python I’ve written is mostly stored in notebooks. It’s exploratory, messy, and not built to last. The current goal is to write cleaner, more structured, and reusable code. Functions, modules, error handling, scripts. This is Python that you can actually incorporate into your pipeline.

Why it’s important: Python is the glue that holds modern data engineering stacks together. Airflow uses it. PySpark is built on top of that. Feeling comfortable here is non-negotiable.

3. Git and GitHub: Proper version control

I’ll be honest. My Git knowledge at this point is just “copy the command and hope it works.” That has to change. Version control is fundamental to working like an engineer and not just an analyst. Learn how to properly manage your code across branches, pull requests, and projects.

Why it’s important: All projects I build from now on will be on GitHub. It’s a portfolio, it’s a discipline, it’s how a team actually works.

4. Apache Spark and PySpark: Big Data Processing

This is where things get really exciting. Apache Spark is one of the most widely used engines for processing large-scale data. PySpark is that Python API. This means you can work with large-scale, distributed data using a somewhat familiar language.

Moving from Pandas to Spark is a mindset change. Pandas runs on a single machine. Spark is built to run across clusters. Learning this distributed way of thinking is one of the skills that separates data engineers from analysts.

Why it’s important: If you want to work with big data in a production environment, using Spark is almost inevitable. This has always been in the job description and is the core of the Databricks ecosystem that I’m trying to build.

5. Apache Airflow: Orchestrating data pipelines

Data pipelines don’t run by themselves. Something is needed to schedule them, monitor them, and handle failures gracefully. That’s where workflow orchestration tools come in. I chose Airflow.

I’ve considered several options here. Databricks Workflows is ideal if you’re already deeply involved in the Databricks ecosystem. Azure Data Factory is suitable for high-load environments in Azure. However, Airflow is free, open source, cloud independent, and widely used across industries. You will also learn core orchestration concepts in ways that can be applied to other tools. We felt it was the right decision to start using Airflow, especially since we are trying to keep costs low.

Why it’s important: Orchestration is what turns a collection of scripts into an actual pipeline. Understanding Airflow means understanding how production data workflows are managed.

6. Databricks: Data Platform

At some point, you need to choose a data platform and dive deep into it. I use Databricks. Built on Spark, we have a high-demand, free Community Edition that lets you practice without paying cloud credits.

There are also plenty of substitutes available. Snowflake is a clean, fast SQL warehouse that many enterprises love. BigQuery is Google’s fully managed serverless option, which is great when using Google Cloud. But Databricks sits at the intersection of big data, machine learning, and data engineering in a way that aligns with where I want to go. That was what made the most sense for my goals.

Why it’s important: Employers want you to have platform experience. Knowing more about one thing is more valuable than knowing a little bit about them all.

How do you structure your 12 months?

The honest answer is that this could take more than 12 months. that’s ok. I’d rather take 15 months and actually figure out what I’m doing than rush and end up with a shaky foundation in 12 months.

A common approach is to learn each skill in sequence and don’t move forward until you’ve built something with what you just learned. Tutorials are good for orientation, but projects are where the real learning happens. My plan is to document each phase of “Toward Data Science”: concepts, projects, setbacks, and successes.

To track our progress, we use Data With Baraa’s Notion roadmap as our backbone. It breaks down each skill into core topics so you can keep track of where you are without being overwhelmed by the big picture all at once.

The recommended time commitment is 3 to 4 hours a day. Part of it will be structured learning. Some people will build. Some people may write about what I learned, but that in itself is a form of learning.

What does success look like?

The goal is to land a well-paying data engineering job. It’s true and I’m not trying to dress it up.

But at the same time, I want to be a trusted voice in this area. Maybe someone who can build something worth talking about, document the journey without leaving out the hard parts, and make the path a little clearer for someone coming after me.

Writing and learning feed each other. Your portfolio is the proof. That evidence builds your brand. That’s the vision.

From today

This article marks my official start date. I don’t wait until I feel ready or everything is perfectly planned. I’m starting now, and as I write, I’m making the process public and a little confusing.

If you’re on a similar path. Whether you’re thinking about engineering in analytics, wondering what’s next in IT, or building the skills to remain valuable in an AI-accelerated world. follow me.

I think we have a lot to talk about. I will also be sharing my learnings on my YouTube channel. Please feel free to subscribe and follow us below.


This is the first article in an ongoing series chronicling my data engineering journey. I’ll be posting regularly about my progress, the projects I’m building, and everything I’ve learned along the way.

Also, for those of you who are on the same journey as me, if you would like to access Notion templates, you can do so here..

Follow my journey below.

YouTube

medium

linkedin

Twitter



Source link