Zach Wilson's banner
Zach Wilson's profile picture

Zach Wilson

@EcZachly50,736 subscribers

Founder @ https://t.co/CWvLDHU2Lx $150k/month | https://t.co/F5VqLpyMZn $5k/month | ADHD | 10 yrs big data experience | ex @meta, @netflix, and @airbnb

Videos

EcZachly's profile picture

SQL has levels to it: - level 1 SELECT, FROM, WHERE, GROUP BY, HAVING, LIMIT Master these basic keywords and you’ll be well on your way to mastering SQL. - level 2 Mastering JOINs: Most common JOINs: INNER and LEFT Less common JOINs: FULL OUTER Joins you should avoid almost always: RIGHT and CROSS JOIN Mastering common table expressions (CTEs). The WITH keyword defines a CTE which you can imagine as a “variable” that you can query later. Using variables like this you can master algorithm techniques like recursion, breadth first search and more! CTEs also make your SQL much more readable and make your coworkers hate you less compared to nested sub queries. - level 3 Mastering window functions Window functions have 3 pieces: The function (i.e. SUM, RANK, AVG) The over clause to start the window The window definition which has 3 pieces: - how to split the window up with PARTITION BY - how to order the window with ORDER BY - how to restrict the window size with ROWS clause (useful for rolling monthly averages) Understand RANK vs DENSE_RANK vs ROW_NUMBER, I have been asked this in interviews a million times. - level 4 You understand table scans, b-tree indexes, and partitioning schemes to increase performance. Doing something like COUNT(CASE WHEN) is much better than doing multiple queries with a UNION ALL. UNION ALL is terrible for all sorts of reasons that I don’t want to get into in this post. B-trees indexes allow for efficient scanning of data in the WHERE clause. Use explain plans to understand if an index is actually being used or not! Partitioning is similar to indexes except it’s a “poor mans” index. It just keeps data in specific folders and skips the folders that don’t include the data I question. What else did I miss for mastering SQL?

Zach Wilson

32,891 görüntüleme • 23 gün önce

EcZachly's profile picture

SQL has levels to it: - level 1 SELECT, FROM, WHERE, GROUP BY, HAVING, LIMIT Master these basic keywords and you’ll be well on your way to mastering SQL. - level 2 Mastering JOINs: Most common JOINs: INNER and LEFT Less common JOINs: FULL OUTER Joins you should avoid almost always: RIGHT and CROSS JOIN Mastering common table expressions (CTEs). The WITH keyword defines a CTE which you can imagine as a “variable” that you can query later. Using variables like this you can master algorithm techniques like recursion, breadth first search and more! CTEs also make your SQL much more readable and make your coworkers hate you less compared to nested sub queries. - level 3 Mastering window functions Window functions have 3 pieces: The function (i.e. SUM, RANK, AVG) The over clause to start the window The window definition which has 3 pieces: - how to split the window up with PARTITION BY - how to order the window with ORDER BY - how to restrict the window size with ROWS clause (useful for rolling monthly averages) Understand RANK vs DENSE_RANK vs ROW_NUMBER, I have been asked this in interviews a million times. - level 4 You understand table scans, b-tree indexes, and partitioning schemes to increase performance. Doing something like COUNT(CASE WHEN) is much better than doing multiple queries with a UNION ALL. UNION ALL is terrible for all sorts of reasons that I don’t want to get into in this post. B-trees indexes allow for efficient scanning of data in the WHERE clause. Use explain plans to understand if an index is actually being used or not! Partitioning is similar to indexes except it’s a “poor mans” index. It just keeps data in specific folders and skips the folders that don’t include the data I question. What else did I miss for mastering SQL?

Zach Wilson

79,419 görüntüleme • 11 ay önce

EcZachly's profile picture

Apache Spark has levels to it: - Level 0 You can run spark-shell or pyspark, it means you can start - Level 1 You understand the Spark execution model: •RDDs vs DataFrames vs Datasets •Transformations (map, filter, groupBy, join) vs Actions (collect, count, show) •Lazy execution & DAG (Directed Acyclic Graph) Master these concepts, and you’ll have a solid foundation - Level 2 Optimizing Spark Queries •Understand Catalyst Optimizer and how it rewrites queries for efficiency. •Master columnar storage and Parquet vs JSON vs CSV. •Use broadcast joins to avoid shuffle nightmares •Shuffle operations are expensive. Reduce them with partitioning and good data modeling •Coalesce vs Repartition—know when to use them. •Avoid UDFs unless absolutely necessary (they bypass Catalyst optimization). Level 3 Tuning for Performance at Scale •Master spark.sql.autoBroadcastJoinThreshold. •Understand how Task Parallelism works and set spark.sql.shuffle.partitions properly. •Skewed Data? Use adaptive execution! •Use EXPLAIN and queryExecution.debug to analyze execution plans. - Level 4 Deep Dive into Cluster Resource Management •Spark on YARN vs Kubernetes vs Standalone—know the tradeoffs. •Understand Executor vs Driver Memory—tune spark.executor.memory and spark.driver.memory. •Dynamic allocation (spark.dynamicAllocation.enabled=true) can save costs. •When to use RDDs over DataFrames (spoiler: almost never). What else did I miss for mastering Spark and distributed compute?

Zach Wilson

36,123 görüntüleme • 1 yıl önce

EcZachly's profile picture

Building Data Pipelines has levels to it: - level 0 Understand the basic flow: Extract → Transform → Load (ETL) or ELT This is the foundation. - Extract: Pull data from sources (APIs, DBs, files) - Transform: Clean, filter, join, or enrich the data - Load: Store into a warehouse or lake for analysis You’re not a data engineer until you’ve scheduled a job to pull CSVs off an SFTP server at 3AM! level 1 Master the tools: - Airflow for orchestration - dbt for transformations - Spark or PySpark for big data - Snowflake, BigQuery, Redshift for warehouses - Kafka or Kinesis for streaming Understand when to batch vs stream. Most companies think they need real-time data. They usually don’t. level 2 Handle complexity with modular design: - DAGs should be atomic, idempotent, and parameterized - Use task dependencies and sensors wisely - Break transformations into layers (staging → clean → marts) - Design for failure recovery. If a step fails, how do you re-run it? From scratch or just that part? Learn how to backfill without breaking the world. level 3 Data quality and observability: - Add tests for nulls, duplicates, and business logic - Use tools like Great Expectations, Monte Carlo, or built-in dbt tests - Track lineage so you know what downstream will break if upstream changes Know the difference between: - a late-arriving dimension - a broken SCD2 - and a pipeline silently dropping rows At this level, you understand that reliability > cleverness. level 4 Build for scale and maintainability: - Version control your pipeline configs - Use feature flags to toggle behavior in prod - Push vs pull architecture - Decouple compute and storage (e.g. Iceberg and Delta Lake) - Data mesh, data contracts, streaming joins, and CDC are words you throw around because you know how and when to use them. What else belongs in the journey to mastering data pipelines?

Zach Wilson

16,688 görüntüleme • 1 yıl önce

Daha fazla içerik yok.