Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

Building Data Pipelines has levels to it: - level 0 Understand the basic flow: Extract → Transform → Load (ETL) or ELT This is the foundation. - Extract: Pull data from sources (APIs, DBs, files) - Transform: Clean, filter, join, or enrich the data - Load: Store into a...

16,688 görüntüleme • 1 yıl önce •via X (Twitter)

9 Yorum

Zach Morris Wilson profil fotoğrafı
Zach Morris Wilson1 yıl önce

Subscribe to for more insights!

HUDI profil fotoğrafı
HUDI1 yıl önce

🚀 Introducing DATA DATA DATA: The Official HUDI Podcast! 🔥 💡 The podcast where data becomes power. We dive into Web3, crypto, data ownership, and the future. Engage, learn, and be part of the revolution. 🌍🔗 Knowledge is power—join the journey. #DataDataData #HUDI #Web3 #Podcast #DataOwnership #BigTech #Privacy #Crypto #AI #DigitalRevolution

A. Ibrahim, PhD profil fotoğrafı
A. Ibrahim, PhD1 yıl önce

Great insights Zach. Do you cover all these in any of your bootcamps? If yes link please. Met you last year in San Francisco btw. I am a fan. Thanks.

Mitch Call M profil fotoğrafı
Mitch Call M1 yıl önce

Classic, these are big value posts Zach

Ulises FV profil fotoğrafı
Ulises FV1 yıl önce

Data

N profil fotoğrafı
N1 yıl önce

Lovely summary!

Mamadou Tafsir Diallo profil fotoğrafı
Mamadou Tafsir Diallo1 yıl önce

Data

Quophi Amakye profil fotoğrafı
Quophi Amakye1 yıl önce

Data

The South Indian profil fotoğrafı
The South Indian1 yıl önce

Data

Benzer Videolar

Apache Spark has levels to it: - Level 0 You can run spark-shell or pyspark, it means you can start - Level 1 You understand the Spark execution model: •RDDs vs DataFrames vs Datasets •Transformations (map, filter, groupBy, join) vs Actions (collect, count, show) •Lazy execution & DAG (Directed Acyclic Graph) Master these concepts, and you’ll have a solid foundation - Level 2 Optimizing Spark Queries •Understand Catalyst Optimizer and how it rewrites queries for efficiency. •Master columnar storage and Parquet vs JSON vs CSV. •Use broadcast joins to avoid shuffle nightmares •Shuffle operations are expensive. Reduce them with partitioning and good data modeling •Coalesce vs Repartition—know when to use them. •Avoid UDFs unless absolutely necessary (they bypass Catalyst optimization). Level 3 Tuning for Performance at Scale •Master spark.sql.autoBroadcastJoinThreshold. •Understand how Task Parallelism works and set spark.sql.shuffle.partitions properly. •Skewed Data? Use adaptive execution! •Use EXPLAIN and queryExecution.debug to analyze execution plans. - Level 4 Deep Dive into Cluster Resource Management •Spark on YARN vs Kubernetes vs Standalone—know the tradeoffs. •Understand Executor vs Driver Memory—tune spark.executor.memory and spark.driver.memory. •Dynamic allocation (spark.dynamicAllocation.enabled=true) can save costs. •When to use RDDs over DataFrames (spoiler: almost never). What else did I miss for mastering Spark and distributed compute?

Zach Wilson

36,123 görüntüleme • 1 yıl önce

Traditional data pipelines don't work for RAG applications. There are 3 issues with them: ​ 1. Traditional data engineering solutions are optimized to handle structured data. RAG applications rely primarily on unstructured data. ​ 2. The connector ecosystem to load data from unstructured data sources is very immature. ​ 3. Traditional solutions do not offer any way to transform unstructured data into an optimized vector search index. ​ The goal of a RAG Pipeline is to solve these problems. ​ The number one objective is to create a reliable vector search index using factual knowledge and relevant context. This sounds easy, but it's one of the biggest challenges we face when building RAG applications. ​ At a high level, there are four different stages in the architecture of a RAG pipeline: ​ 1. Ingestion: Here is where the pipeline loads the information from the data source. ​ 2. Extraction: Where the pipeline processes the input data and decides how to retrieve the text contained inside them. ​ 3. Transform: Where the pipeline chunks the data and generates document embeddings. ​ 4. Load: Where the pipeline creates a search index in a vector database and loads the document embeddings. ​ There are different rabbit holes at each one of these stages. Here are three of them: ​ 1. Ingesting data once is simple. The hard part is refreshing the vector database whenever the original data source changes. ​ 2. Extracting the content of a plain text document is simple. The hard part is to extract content from complex documents containing tables, images, or cross-references. ​ 3. A simple continual chunking strategy with an overlap is simple. The hard part is to find the optimal strategy for your specific knowledge base and the way you are planning to query it. ​ In the attached video, I'll show you how you can build an enterprise-grade RAG Pipeline that solves every one of the above problems. ​ I'll use Vectorize. They partnered with me on this post. You can use them to build RAG pipelines optimized for accurate context retrieval. ​ ​ If you have a few documents lying around, set up a free account and give it a try.

Santiago

40,441 görüntüleme • 1 yıl önce

Major program launch: Data Analytics Professional Certificate! This large, five-course sequence takes you all the way to being job-ready as a data analyst, and shows how to use Generative AI as a thought partner to enhance your work in this role. Offered by on Coursera, this is taught by Sean Barnes, Ph.D., a Data Science & Engineering Leader at Netflix. Analyzing data remains one of the most important skills in where the world is going with AI. This comprehensive certificate takes you all the way to being job-ready. Each course comes with practical projects demonstrated in real-world contexts, such as analyzing sales data for a Korean bakery, video game sales trends across different regions, or identifying factors impacting customer retention for a communications company. You'll also work on estimating fire distribution for forest fire prevention, analyzing how a diamond's properties affect its market value, and developing predictive models for retail sales analysis, carbon emissions, and coral reef conservation. Here's some of what you'll learn: - How to define data and categorize it into its many types such as discrete & continuous numerical, structured & unstructured, time series, categorical, and know what insights can be derived from the different types of data categories. - How to differentiate between data-related job roles and their responsibilities, and how data flows through an organization from the moment of capture to decision-making. - How to perform data processing functions and apply conditional formatting in spreadsheets to extract business value from your data using statistical calculations and best practices for visualizing and interpreting data. - How to use LLMs for stakeholder analysis, data exploration, and data visualization. - Best practices for using LLMs for as a thought partner to data analysis work By the end of this professional certificate program, you will have learned core statistical concepts, analysis techniques, and visualization methodologies that will serve as the foundation for working as a data analyst. The world needs more data analysts, especially ones who know how to use modern generative AI. With data science roles projected to grow 36% by 2033, the skills taught in this program create new professional opportunities in data. Sign up here!

Andrew Ng

84,686 görüntüleme • 1 yıl önce

We just launched a major new Data Engineering Professional Certificate on Coursera! Data underlies all modern AI systems, and engineers who know how to build systems to store and serve it are in high demand. If you're interested in learning this skill, please check out this 4-course sequence, which is designed to make you job-ready to be a Data Engineer. This is a new specialization taught by Joe Reis, the co-author of the best-selling book “Fundamentals of Data Engineering," in collaboration with AWS. (Disclosure, I serve on Amazon's board.) For many AI systems, data engineering is 80% of the work, and modeling is 20%. But people’s attention on these two topics is often flipped. This makes the job of the data engineer particularly important. In this professional certificate, you'll learn foundational data engineering skills while implementing modern data architectures using open-source tools: - Learn the key steps of the data lifecycle, to generate, ingest, store, transform, and serve data. - Learn to align with organizational goals to design the data pipeline right for your business' needs. - Understand how to make necessary trade-offs between speed, scalability, security, and cost. Joe has distilled into this specialization decades of experience helping startups and large companies with data infrastructure. He is also joined by 17 other industry leaders in the data field, who will help you learn in-demand skills for the growing field of data engineering. Please sign up here:

Andrew Ng

118,937 görüntüleme • 1 yıl önce

SQL has levels to it: - level 1 SELECT, FROM, WHERE, GROUP BY, HAVING, LIMIT Master these basic keywords and you’ll be well on your way to mastering SQL. - level 2 Mastering JOINs: Most common JOINs: INNER and LEFT Less common JOINs: FULL OUTER Joins you should avoid almost always: RIGHT and CROSS JOIN Mastering common table expressions (CTEs). The WITH keyword defines a CTE which you can imagine as a “variable” that you can query later. Using variables like this you can master algorithm techniques like recursion, breadth first search and more! CTEs also make your SQL much more readable and make your coworkers hate you less compared to nested sub queries. - level 3 Mastering window functions Window functions have 3 pieces: The function (i.e. SUM, RANK, AVG) The over clause to start the window The window definition which has 3 pieces: - how to split the window up with PARTITION BY - how to order the window with ORDER BY - how to restrict the window size with ROWS clause (useful for rolling monthly averages) Understand RANK vs DENSE_RANK vs ROW_NUMBER, I have been asked this in interviews a million times. - level 4 You understand table scans, b-tree indexes, and partitioning schemes to increase performance. Doing something like COUNT(CASE WHEN) is much better than doing multiple queries with a UNION ALL. UNION ALL is terrible for all sorts of reasons that I don’t want to get into in this post. B-trees indexes allow for efficient scanning of data in the WHERE clause. Use explain plans to understand if an index is actually being used or not! Partitioning is similar to indexes except it’s a “poor mans” index. It just keeps data in specific folders and skips the folders that don’t include the data I question. What else did I miss for mastering SQL?

Zach Wilson

79,419 görüntüleme • 11 ay önce

SQL has levels to it: - level 1 SELECT, FROM, WHERE, GROUP BY, HAVING, LIMIT Master these basic keywords and you’ll be well on your way to mastering SQL. - level 2 Mastering JOINs: Most common JOINs: INNER and LEFT Less common JOINs: FULL OUTER Joins you should avoid almost always: RIGHT and CROSS JOIN Mastering common table expressions (CTEs). The WITH keyword defines a CTE which you can imagine as a “variable” that you can query later. Using variables like this you can master algorithm techniques like recursion, breadth first search and more! CTEs also make your SQL much more readable and make your coworkers hate you less compared to nested sub queries. - level 3 Mastering window functions Window functions have 3 pieces: The function (i.e. SUM, RANK, AVG) The over clause to start the window The window definition which has 3 pieces: - how to split the window up with PARTITION BY - how to order the window with ORDER BY - how to restrict the window size with ROWS clause (useful for rolling monthly averages) Understand RANK vs DENSE_RANK vs ROW_NUMBER, I have been asked this in interviews a million times. - level 4 You understand table scans, b-tree indexes, and partitioning schemes to increase performance. Doing something like COUNT(CASE WHEN) is much better than doing multiple queries with a UNION ALL. UNION ALL is terrible for all sorts of reasons that I don’t want to get into in this post. B-trees indexes allow for efficient scanning of data in the WHERE clause. Use explain plans to understand if an index is actually being used or not! Partitioning is similar to indexes except it’s a “poor mans” index. It just keeps data in specific folders and skips the folders that don’t include the data I question. What else did I miss for mastering SQL?

Zach Wilson

32,891 görüntüleme • 23 gün önce

Your agents can't keep up with real-time data. Especially when it's scattered across dozens of sources. Most teams waste weeks building custom connectors for every database, API, and data warehouse. Then they build ETL pipelines to sync everything. By the time your agent retrieves the data, it's already outdated. Picture this: Your Postgres database updated 5 minutes ago. Your MongoDB collection changed 2 minutes ago. Your agent is still pulling from yesterday's snapshot. This is why most production RAG systems fail. There's a better approach: MindsDB is an open-source AI platform with a federated data engine that lets you query multiple data sources in real-time using SQL - without moving any data. Here's what makes it different: ↳ Your data stays in place. No ETL pipelines or data duplication ↳ Query Postgres, MongoDB, REST APIs, and more using consistent SQL ↳ JOIN across different sources in real-time with a unified interface ↳ Works with both structured and un-structured data And here's the best part: You don't even need to write SQL. Just describe what you want in plain English, and MindsDB converts it to SQL automatically. The system does all the heavy lifting. The breakthrough for AI agents is simple: When data updates at the source, your agent gets fresh results immediately. No sync delays. No stale embeddings. No custom code for each integration. You can literally write a SQL query that joins a Postgres table with a MongoDB collection and gets live results. This is what production AI applications need but rarely get. In this video, I give you a complete walkthrough of what we just discussed and how to actually do it. Make sure you watch this till the end. I've shared the link to MindsDB's GitHub repo in the next tweet!

Akshay 🚀

65,672 görüntüleme • 6 ay önce