Data Engineering Guide: From Excel to Professional Pipelines
TL;DR
- Data Engineering = building infrastructure so data flows reliably
- It’s not Data Science (they analyze, we build the pipes)
- 80% of the job is cleaning data, not glamorous stuff
- Basic stack: SQL + Python + an orchestrator (Airflow, Prefect)
- If you come from Excel/Power BI, you have more foundation than you think
What Data Engineering Really Is
A Data Engineer builds and maintains the infrastructure that allows data to flow from point A to point B reliably.
Point A: Databases, APIs, files, sensors, logs, whatever. Point B: Data warehouses, dashboards, ML models, applications.
Your job is making that flow work. Always. Without errors. At scale.
What It’s NOT
- Not Data Science: They build predictive models. You build the pipeline that feeds those models.
- Not Data Analysis: They extract insights from data. You make sure they have clean data to analyze.
- Not DBA: They optimize databases. You move data between systems.
There’s overlap, especially in small companies. But the specialization is real.
The Real Day-to-Day
If you think a Data Engineer spends their day designing elegant architectures… not exactly.
What I do 80% of the time
1. Clean data
- Dates in 15 different formats
- Fields that should be numeric but have “N/A”, ”-”, ” ”
- Duplicates that you don’t know if they’re duplicates or different records
- Broken encoding (goodbye ñ, hello ñ)
2. Debug pipelines
- “Yesterday’s dashboard is empty” → vendor’s SFTP changed IP
- “Numbers don’t match” → someone changed a column in the ERP
- “The process took 6 hours” → a query without an index
3. Convince people
- “But my Excel works fine” → until it doesn’t
- “Why can’t we use Google Sheets?” → because 500MB of data
- “This is urgent” → everything is urgent
4. Document (or regret not doing it)
- “What does this 2019 script do?” → nobody knows, the author left
What I do 20% of the time
- Design new pipelines
- Evaluate tools
- Optimize existing processes
- Learn new things
That 20% is the fun part. But the 80% pays the bills.
The Modern Stack (2026)
Level 1: The minimum you need
| Tool | For what |
|---|---|
| SQL | Query and transform data in databases |
| Python | Scripts, automation, APIs |
| Git | Version control (yes, for data too) |
| Terminal/Bash | Basic automation |
With this you can do 70% of the work. Seriously.
Level 2: The typical company stack
| Category | Popular options |
|---|---|
| Orchestration | Airflow, Prefect, Dagster |
| Transformation | dbt, Spark, pandas |
| Storage | Snowflake, BigQuery, Databricks, Redshift |
| Ingestion | Fivetran, Airbyte, Singer |
| Quality | Great Expectations, dbt tests, Soda |
Level 3: What large companies add
- Data Catalog: Datahub, Amundsen, Atlan
- Governance: Collibra, Alation
- Monitoring: Monte Carlo, Bigeye
- Streaming: Kafka, Kinesis, Flink
My recommendation to start
- Solid SQL (not basic, solid)
- Python with pandas and requests
- One orchestrator (Prefect is easier than Airflow)
- dbt for transformations
- A warehouse (BigQuery has a generous free tier)
With that you can work at 90% of companies.
ETL vs ELT: The Paradigm Shift
ETL (Extract, Transform, Load) - The Classic Approach
Source → [Transform] → Data Warehouse
You transform data BEFORE loading. Requires knowing what you need upfront.
Pros: Less data in warehouse, more control. Cons: Less flexible, more complex pipelines.
ELT (Extract, Load, Transform) - The Modern Approach
Source → Data Warehouse → [Transform]
You load everything and transform INSIDE the warehouse using SQL.
Pros: More flexible, tools like dbt make it elegant. Cons: More data stored, storage costs.
The Reality
Most modern projects use ELT because:
- Storage is cheap
- Modern warehouses are very fast
- dbt makes transformations elegant
- You can iterate without rebuilding pipelines
But ETL still makes sense for:
- Sensitive data that shouldn’t arrive raw
- Very large volumes where transform reduces a lot
- Legacy systems that don’t support the modern approach
Data Quality: 80% of the Job
The prettiest data is useless if it’s wrong. In fact, 90% of your data is garbage and needs cleaning before it’s useful.
Typical Problems
| Problem | Example |
|---|---|
| Completeness | 30% of emails empty |
| Uniqueness | Customer duplicates |
| Consistency | ”USA”, “US”, “usa”, “United States” |
| Validity | Future dates in “birth date” |
| Accuracy | Incorrect rounding |
| Timeliness | 3-month-old data as “current” |
How We Attack It
1. Validation at ingestion
# Simple example with Great Expectations
expect_column_values_to_not_be_null("email")
expect_column_values_to_match_regex("email", r".*@.*\..*")
2. Tests in transformations
-- dbt test
SELECT * FROM customers
WHERE created_at > CURRENT_DATE -- Should return no records
3. Continuous monitoring
- Alerts when metrics deviate
- Data quality dashboards
- Automatic anomaly detection
4. Clear ownership
- Who is responsible for each dataset?
- Who approves definition changes?
- Who responds when something fails?
If You Come From Excel/Power BI
Good news: you have more foundation than you think.
What You Already Know (even if you don’t call it that)
| In Excel/Power BI | In Data Engineering |
|---|---|
| VLOOKUP, INDEX/MATCH | JOINs |
| Pivot tables | GROUP BY, aggregations |
| Power Query | ETL/ELT |
| Table relationships | Dimensional modeling |
| Macros/VBA | Python scripts |
| Data connections | Data ingestion |
The Transition Path
Step 1: Deep SQL
- Not just SELECT, but window functions, CTEs, subqueries
- Query optimization
- Different dialects (SQL Server, PostgreSQL, BigQuery)
Step 2: Basic Python
- pandas for data manipulation
- requests for APIs
- Automating repetitive tasks
Step 3: A real project
- Automate something you do manually
- Move data from A to B with a script
- Schedule a daily execution
Step 4: The ecosystem
- Git to version your code
- An orchestrator to schedule jobs
- A warehouse to store results
Power Query is literally an ETL tool. If you master it, you already understand the concepts. You just need to translate them to “grown-up” tools.
The Future: Data Engineering + AI
2026 is changing the role:
What AI already does well
- Generate basic SQL queries
- Write simple transformations
- Document existing code
- Detect data anomalies
What remains human
- Understanding the business
- Designing architectures that scale
- Debugging complex problems
- Deciding what data matters
My prediction
The junior Data Engineer who only knows how to run queries has numbered days. The one who understands the business, designs systems, and uses AI as a tool… that one has a future.
How to Start (3-Month Plan)
Month 1: Fundamentals
- Advanced SQL (window functions, CTEs)
- Basic Python (pandas, files, APIs)
- Basic Git
Month 2: Tools
- One orchestrator (Prefect or Airflow)
- dbt for transformations
- A warehouse (BigQuery free tier)
Month 3: Real Project
- Choose a real problem (your data, public data)
- Complete pipeline: ingestion → transformation → output
- Documentation and basic tests
At the end of 3 months you’ll have something to show in interviews.
Recommended Resources
To Start
- Mode SQL Tutorial - Free and well-explained SQL
- Dataquest - Python for data
- The Data Engineering Handbook - Community resource
To Go Deeper
- “Fundamentals of Data Engineering” (Reis & Housley) - The reference book
- dbt Learn - Free official courses
- Data Engineering Zoomcamp - Free bootcamp
Communities
- Data Engineering Discord
- r/dataengineering on Reddit
- Locally Optimistic (Slack)
Conclusion
Data Engineering is building the infrastructure that makes data flow. 80% is cleaning data and debugging pipelines. 20% is designing elegant things.
If you come from Excel/Power BI, you already have the mindset. You just need to learn professional tools: solid SQL, Python, an orchestrator, dbt.
The field is growing, salaries are good, and AI won’t replace you if you understand the business beyond the technology.
Start with SQL. Really. Everything else comes after. As you advance, you might explore more sophisticated approaches like Data Fabric architecture for complex enterprise environments.
You might also like
DataOps: how Netflix and Spotify manage data at scale
DevOps revolutionized software development. DataOps is doing the same for data. Practical guide with tools and real cases.
Data Fabric: what it is and why you should care
Unified data architecture regardless of where data lives. What it means for a data engineer and how it relates to tools you already use.
Why I stopped being 'the dashboard guy' and learned Data Engineering
The story of how I went from a stuck analyst making reports to actually understanding how data works. And why you should consider it.