Zach Wilson's banner

Zach Wilson

@EcZachly • 51,233 subscribers

Founder @ https://t.co/CWvLDHU2Lx $80k/month | https://t.co/F5VqLpyMZn $5k/month | ADHD | 10 yrs big data experience | ex @meta, @netflix, and @airbnb

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

SQL has levels to it: - level 1 SELECT, FROM, WHERE, GROUP BY, HAVING, LIMIT Master these basic keywords and you’ll be well on your way to mastering SQL. - level 2 Mastering JOINs: Most common JOINs: INNER and LEFT Less common JOINs: FULL OUTER Joins you should avoid almost always: RIGHT and CROSS JOIN Mastering common table expressions (CTEs). The WITH keyword defines a CTE which you can imagine as a “variable” that you can query later. Using variables like this you can master algorithm techniques like recursion, breadth first search and more! CTEs also make your SQL much more readable and make your coworkers hate you less compared to nested sub queries. - level 3 Mastering window functions Window functions have 3 pieces: The function (i.e. SUM, RANK, AVG) The over clause to start the window The window definition which has 3 pieces: - how to split the window up with PARTITION BY - how to order the window with ORDER BY - how to restrict the window size with ROWS clause (useful for rolling monthly averages) Understand RANK vs DENSE_RANK vs ROW_NUMBER, I have been asked this in interviews a million times. - level 4 You understand table scans, b-tree indexes, and partitioning schemes to increase performance. Doing something like COUNT(CASE WHEN) is much better than doing multiple queries with a UNION ALL. UNION ALL is terrible for all sorts of reasons that I don’t want to get into in this post. B-trees indexes allow for efficient scanning of data in the WHERE clause. Use explain plans to understand if an index is actually being used or not! Partitioning is similar to indexes except it’s a “poor mans” index. It just keeps data in specific folders and skips the folders that don’t include the data I question. What else did I miss for mastering SQL?

SQL has levels to it: - level 1 SELECT, FROM, WHERE, GROUP BY, HAVING, LIMIT Master these basic keywords and you’ll be well on your way to mastering SQL. - level 2 Mastering JOINs: Most common JOINs: INNER and LEFT Less common JOINs: FULL OUTER Joins you should avoid almost always: RIGHT and CROSS JOIN Mastering common table expressions (CTEs). The WITH keyword defines a CTE which you can imagine as a “variable” that you can query later. Using variables like this you can master algorithm techniques like recursion, breadth first search and more! CTEs also make your SQL much more readable and make your coworkers hate you less compared to nested sub queries. - level 3 Mastering window functions Window functions have 3 pieces: The function (i.e. SUM, RANK, AVG) The over clause to start the window The window definition which has 3 pieces: - how to split the window up with PARTITION BY - how to order the window with ORDER BY - how to restrict the window size with ROWS clause (useful for rolling monthly averages) Understand RANK vs DENSE_RANK vs ROW_NUMBER, I have been asked this in interviews a million times. - level 4 You understand table scans, b-tree indexes, and partitioning schemes to increase performance. Doing something like COUNT(CASE WHEN) is much better than doing multiple queries with a UNION ALL. UNION ALL is terrible for all sorts of reasons that I don’t want to get into in this post. B-trees indexes allow for efficient scanning of data in the WHERE clause. Use explain plans to understand if an index is actually being used or not! Partitioning is similar to indexes except it’s a “poor mans” index. It just keeps data in specific folders and skips the folders that don’t include the data I question. What else did I miss for mastering SQL?

79,538 Aufrufe • vor 1 Jahr

Apache Spark has levels to it: - Level 0 You can run spark-shell or pyspark, it means you can start - Level 1 You understand the Spark execution model: •RDDs vs DataFrames vs Datasets •Transformations (map, filter, groupBy, join) vs Actions (collect, count, show) •Lazy execution & DAG (Directed Acyclic Graph) Master these concepts, and you’ll have a solid foundation - Level 2 Optimizing Spark Queries •Understand Catalyst Optimizer and how it rewrites queries for efficiency. •Master columnar storage and Parquet vs JSON vs CSV. •Use broadcast joins to avoid shuffle nightmares •Shuffle operations are expensive. Reduce them with partitioning and good data modeling •Coalesce vs Repartition—know when to use them. •Avoid UDFs unless absolutely necessary (they bypass Catalyst optimization). Level 3 Tuning for Performance at Scale •Master spark.sql.autoBroadcastJoinThreshold. •Understand how Task Parallelism works and set spark.sql.shuffle.partitions properly. •Skewed Data? Use adaptive execution! •Use EXPLAIN and queryExecution.debug to analyze execution plans. - Level 4 Deep Dive into Cluster Resource Management •Spark on YARN vs Kubernetes vs Standalone—know the tradeoffs. •Understand Executor vs Driver Memory—tune spark.executor.memory and spark.driver.memory. •Dynamic allocation (spark.dynamicAllocation.enabled=true) can save costs. •When to use RDDs over DataFrames (spoiler: almost never). What else did I miss for mastering Spark and distributed compute?

Apache Spark has levels to it: - Level 0 You can run spark-shell or pyspark, it means you can start - Level 1 You understand the Spark execution model: •RDDs vs DataFrames vs Datasets •Transformations (map, filter, groupBy, join) vs Actions (collect, count, show) •Lazy execution & DAG (Directed Acyclic Graph) Master these concepts, and you’ll have a solid foundation - Level 2 Optimizing Spark Queries •Understand Catalyst Optimizer and how it rewrites queries for efficiency. •Master columnar storage and Parquet vs JSON vs CSV. •Use broadcast joins to avoid shuffle nightmares •Shuffle operations are expensive. Reduce them with partitioning and good data modeling •Coalesce vs Repartition—know when to use them. •Avoid UDFs unless absolutely necessary (they bypass Catalyst optimization). Level 3 Tuning for Performance at Scale •Master spark.sql.autoBroadcastJoinThreshold. •Understand how Task Parallelism works and set spark.sql.shuffle.partitions properly. •Skewed Data? Use adaptive execution! •Use EXPLAIN and queryExecution.debug to analyze execution plans. - Level 4 Deep Dive into Cluster Resource Management •Spark on YARN vs Kubernetes vs Standalone—know the tradeoffs. •Understand Executor vs Driver Memory—tune spark.executor.memory and spark.driver.memory. •Dynamic allocation (spark.dynamicAllocation.enabled=true) can save costs. •When to use RDDs over DataFrames (spoiler: almost never). What else did I miss for mastering Spark and distributed compute?

36,123 Aufrufe • vor 1 Jahr

Getting a big tech data engineer job in 2016: - do you know SQL? - yes - here’s $500k Getting a big tech data engineer job in 2024: - do you know Spark, Kafka, Iceberg? - yes - did you shake hands with Bill Inmon when he invented the data warehouse? - no - rejected and ghosted Here’s why! Follow to stay up to date on data engineering!

Getting a big tech data engineer job in 2016: - do you know SQL? - yes - here’s $500k Getting a big tech data engineer job in 2024: - do you know Spark, Kafka, Iceberg? - yes - did you shake hands with Bill Inmon when he invented the data warehouse? - no - rejected and ghosted Here’s why! Follow to stay up to date on data engineering!

43,547 Aufrufe • vor 1 Jahr

There’s only 7 patterns in analytics! - aggregation - experimentation - prediction - decision trees - funnels - clustering - cumulation and derivative What other techniques do you know? Follow my blog at for more in depth analytical tips!

There’s only 7 patterns in analytics! - aggregation - experimentation - prediction - decision trees - funnels - clustering - cumulation and derivative What other techniques do you know? Follow my blog at for more in depth analytical tips!

25,429 Aufrufe • vor 1 Jahr

Here's how I would learn data engineering basics in 2025: - Find a data source you care about (examples: gaming APIs, stock market, web scraping, etc) - Use Python to interact and ingest your source. Initially just write the data to a CSV. - Setup an account with Snowflake or Google BigQuery. - update your Python script to load a table in Snowflake/BigQuery - schedule your script with CRON in the cloud with a service like Heroku. - build aggregations and visualizations on top of your ingested data Only thing this misses is data quality and complex job orchestration which you can learn later! How would you learn data engineering nowadays?

Here's how I would learn data engineering basics in 2025: - Find a data source you care about (examples: gaming APIs, stock market, web scraping, etc) - Use Python to interact and ingest your source. Initially just write the data to a CSV. - Setup an account with Snowflake or Google BigQuery. - update your Python script to load a table in Snowflake/BigQuery - schedule your script with CRON in the cloud with a service like Heroku. - build aggregations and visualizations on top of your ingested data Only thing this misses is data quality and complex job orchestration which you can learn later! How would you learn data engineering nowadays?

20,368 Aufrufe • vor 1 Jahr

Building Data Pipelines has levels to it: - level 0 Understand the basic flow: Extract → Transform → Load (ETL) or ELT This is the foundation. - Extract: Pull data from sources (APIs, DBs, files) - Transform: Clean, filter, join, or enrich the data - Load: Store into a warehouse or lake for analysis You’re not a data engineer until you’ve scheduled a job to pull CSVs off an SFTP server at 3AM! level 1 Master the tools: - Airflow for orchestration - dbt for transformations - Spark or PySpark for big data - Snowflake, BigQuery, Redshift for warehouses - Kafka or Kinesis for streaming Understand when to batch vs stream. Most companies think they need real-time data. They usually don’t. level 2 Handle complexity with modular design: - DAGs should be atomic, idempotent, and parameterized - Use task dependencies and sensors wisely - Break transformations into layers (staging → clean → marts) - Design for failure recovery. If a step fails, how do you re-run it? From scratch or just that part? Learn how to backfill without breaking the world. level 3 Data quality and observability: - Add tests for nulls, duplicates, and business logic - Use tools like Great Expectations, Monte Carlo, or built-in dbt tests - Track lineage so you know what downstream will break if upstream changes Know the difference between: - a late-arriving dimension - a broken SCD2 - and a pipeline silently dropping rows At this level, you understand that reliability > cleverness. level 4 Build for scale and maintainability: - Version control your pipeline configs - Use feature flags to toggle behavior in prod - Push vs pull architecture - Decouple compute and storage (e.g. Iceberg and Delta Lake) - Data mesh, data contracts, streaming joins, and CDC are words you throw around because you know how and when to use them. What else belongs in the journey to mastering data pipelines?

Building Data Pipelines has levels to it: - level 0 Understand the basic flow: Extract → Transform → Load (ETL) or ELT This is the foundation. - Extract: Pull data from sources (APIs, DBs, files) - Transform: Clean, filter, join, or enrich the data - Load: Store into a warehouse or lake for analysis You’re not a data engineer until you’ve scheduled a job to pull CSVs off an SFTP server at 3AM! level 1 Master the tools: - Airflow for orchestration - dbt for transformations - Spark or PySpark for big data - Snowflake, BigQuery, Redshift for warehouses - Kafka or Kinesis for streaming Understand when to batch vs stream. Most companies think they need real-time data. They usually don’t. level 2 Handle complexity with modular design: - DAGs should be atomic, idempotent, and parameterized - Use task dependencies and sensors wisely - Break transformations into layers (staging → clean → marts) - Design for failure recovery. If a step fails, how do you re-run it? From scratch or just that part? Learn how to backfill without breaking the world. level 3 Data quality and observability: - Add tests for nulls, duplicates, and business logic - Use tools like Great Expectations, Monte Carlo, or built-in dbt tests - Track lineage so you know what downstream will break if upstream changes Know the difference between: - a late-arriving dimension - a broken SCD2 - and a pipeline silently dropping rows At this level, you understand that reliability > cleverness. level 4 Build for scale and maintainability: - Version control your pipeline configs - Use feature flags to toggle behavior in prod - Push vs pull architecture - Decouple compute and storage (e.g. Iceberg and Delta Lake) - Data mesh, data contracts, streaming joins, and CDC are words you throw around because you know how and when to use them. What else belongs in the journey to mastering data pipelines?

16,688 Aufrufe • vor 1 Jahr

Building contextual pipelines for AI is the future! The stack you’re looking for: - compute: Spark or DuckDB - storage: S3 - catalog: Glue catalog - metadata: Apache Iceberg - vector DB: Milvus - foundational model: Claude Sonnet 4.5 - end-to-end AI workflow: n8n or OpenAI AgentKit If you want to build workflows and pipelines like this, join my October 20th AI boot camp!

Building contextual pipelines for AI is the future! The stack you’re looking for: - compute: Spark or DuckDB - storage: S3 - catalog: Glue catalog - metadata: Apache Iceberg - vector DB: Milvus - foundational model: Claude Sonnet 4.5 - end-to-end AI workflow: n8n or OpenAI AgentKit If you want to build workflows and pipelines like this, join my October 20th AI boot camp!

11,130 Aufrufe • vor 9 Monaten

Keine weiteren Inhalte verfügbar