Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Building Data Pipelines has levels to it: - level 0 Understand the basic flow: Extract → Transform → Load (ETL) or ELT This is the foundation. - Extract: Pull data from sources (APIs, DBs, files) - Transform: Clean, filter, join, or enrich the data - Load: Store into a... warehouse or lake for analysis You’re not a data engineer until you’ve scheduled a job to pull CSVs off an SFTP server at 3AM! level 1 Master the tools: - Airflow for orchestration - dbt for transformations - Spark or PySpark for big data - Snowflake, BigQuery, Redshift for warehouses - Kafka or Kinesis for streaming Understand when to batch vs stream. Most companies think they need real-time data. They usually don’t. level 2 Handle complexity with modular design: - DAGs should be atomic, idempotent, and parameterized - Use task dependencies and sensors wisely - Break transformations into layers (staging → clean → marts) - Design for failure recovery. If a step fails, how do you re-run it? From scratch or just that part? Learn how to backfill without breaking the world. level 3 Data quality and observability: - Add tests for nulls, duplicates, and business logic - Use tools like Great Expectations, Monte Carlo, or built-in dbt tests - Track lineage so you know what downstream will break if upstream changes Know the difference between: - a late-arriving dimension - a broken SCD2 - and a pipeline silently dropping rows At this level, you understand that reliability > cleverness. level 4 Build for scale and maintainability: - Version control your pipeline configs - Use feature flags to toggle behavior in prod - Push vs pull architecture - Decouple compute and storage (e.g. Iceberg and Delta Lake) - Data mesh, data contracts, streaming joins, and CDC are words you throw around because you know how and when to use them. What else belongs in the journey to mastering data pipelines?show more

Zach Wilson

49,778 subscribers

16,688 просмотров • 1 год назад •via X (Twitter)

Наука и технологии Образование

Anya Rossi• Live Now

Private livecam show

Комментарии: 9

Фото профиля Zach Morris Wilson

Zach Morris Wilson1 год назад

Subscribe to for more insights!

Фото профиля HUDI

HUDI1 год назад

🚀 Introducing DATA DATA DATA: The Official HUDI Podcast! 🔥 💡 The podcast where data becomes power. We dive into Web3, crypto, data ownership, and the future. Engage, learn, and be part of the revolution. 🌍🔗 Knowledge is power—join the journey. #DataDataData #HUDI #Web3 #Podcast #DataOwnership #BigTech #Privacy #Crypto #AI #DigitalRevolution

Фото профиля A. Ibrahim, PhD

A. Ibrahim, PhD1 год назад

Great insights Zach. Do you cover all these in any of your bootcamps? If yes link please. Met you last year in San Francisco btw. I am a fan. Thanks.

Фото профиля Mitch Call M

Mitch Call M1 год назад

Classic, these are big value posts Zach

Фото профиля Ulises FV

Ulises FV1 год назад

Data

Фото профиля N

N1 год назад

Lovely summary!

Фото профиля Mamadou Tafsir Diallo

Mamadou Tafsir Diallo1 год назад

Data

Фото профиля Quophi Amakye

Quophi Amakye1 год назад

Data

Фото профиля The South Indian

The South Indian1 год назад

Data

Похожие видео

Apache Spark has levels to it: - Level 0 You can run spark-shell or pyspark, it means you can start - Level 1 You understand the Spark execution model: •RDDs vs DataFrames vs Datasets •Transformations (map, filter, groupBy, join) vs Actions (collect, count, show) •Lazy execution & DAG (Directed Acyclic Graph) Master these concepts, and you’ll have a solid foundation - Level 2 Optimizing Spark Queries •Understand Catalyst Optimizer and how it rewrites queries for efficiency. •Master columnar storage and Parquet vs JSON vs CSV. •Use broadcast joins to avoid shuffle nightmares •Shuffle operations are expensive. Reduce them with partitioning and good data modeling •Coalesce vs Repartition—know when to use them. •Avoid UDFs unless absolutely necessary (they bypass Catalyst optimization). Level 3 Tuning for Performance at Scale •Master spark.sql.autoBroadcastJoinThreshold. •Understand how Task Parallelism works and set spark.sql.shuffle.partitions properly. •Skewed Data? Use adaptive execution! •Use EXPLAIN and queryExecution.debug to analyze execution plans. - Level 4 Deep Dive into Cluster Resource Management •Spark on YARN vs Kubernetes vs Standalone—know the tradeoffs. •Understand Executor vs Driver Memory—tune spark.executor.memory and spark.driver.memory. •Dynamic allocation (spark.dynamicAllocation.enabled=true) can save costs. •When to use RDDs over DataFrames (spoiler: almost never). What else did I miss for mastering Spark and distributed compute?

Apache Spark has levels to it: - Level 0 You can run spark-shell or pyspark, it means you can start - Level 1 You understand the Spark execution model: •RDDs vs DataFrames vs Datasets •Transformations (map, filter, groupBy, join) vs Actions (collect, count, show) •Lazy execution & DAG (Directed Acyclic Graph) Master these concepts, and you’ll have a solid foundation - Level 2 Optimizing Spark Queries •Understand Catalyst Optimizer and how it rewrites queries for efficiency. •Master columnar storage and Parquet vs JSON vs CSV. •Use broadcast joins to avoid shuffle nightmares •Shuffle operations are expensive. Reduce them with partitioning and good data modeling •Coalesce vs Repartition—know when to use them. •Avoid UDFs unless absolutely necessary (they bypass Catalyst optimization). Level 3 Tuning for Performance at Scale •Master spark.sql.autoBroadcastJoinThreshold. •Understand how Task Parallelism works and set spark.sql.shuffle.partitions properly. •Skewed Data? Use adaptive execution! •Use EXPLAIN and queryExecution.debug to analyze execution plans. - Level 4 Deep Dive into Cluster Resource Management •Spark on YARN vs Kubernetes vs Standalone—know the tradeoffs. •Understand Executor vs Driver Memory—tune spark.executor.memory and spark.driver.memory. •Dynamic allocation (spark.dynamicAllocation.enabled=true) can save costs. •When to use RDDs over DataFrames (spoiler: almost never). What else did I miss for mastering Spark and distributed compute?

Zach Wilson

36,123 просмотров • 1 год назад

In the next 2-3 years, the skills you need to build an ETL process will fundamentally differ from anything we've done in the last decade. The EXTRACT and LOAD steps of an ETL pipeline have been solved forever. They are boring, and I suspect they will continue to be boring. But the TRANSFORM step is where the money is at (and where innovation will continue happening!) An ETL pipeline that can use Large Language Models effectively to transform data is a whole new ballgame. I recorded a video to talk about this. And more importantly, I talk about how you can build ETL pipelines to transform UNSTRUCTURED data (the kind of data that's really hard to handle).

In the next 2-3 years, the skills you need to build an ETL process will fundamentally differ from anything we've done in the last decade. The EXTRACT and LOAD steps of an ETL pipeline have been solved forever. They are boring, and I suspect they will continue to be boring. But the TRANSFORM step is where the money is at (and where innovation will continue happening!) An ETL pipeline that can use Large Language Models effectively to transform data is a whole new ballgame. I recorded a video to talk about this. And more importantly, I talk about how you can build ETL pipelines to transform UNSTRUCTURED data (the kind of data that's really hard to handle).

Santiago

56,412 просмотров • 1 год назад

Enrollment is now open for the Data Engineering Professional Certificate! Data engineers are the architects of modern organizations, ensuring data is reliable, accessible, and ready for analytics and machine learning. This professional certificate is tailored to equip you with the critical skills, through frameworks and hands-on practice, to excel in this role. Taught by industry expert Joe Reis, co-author of the best-selling book "Fundamentals of Data Engineering," along with 17 guest instructors from the data field, you will gain expertise to start and further your career in the high-demand field of data engineering. Key focus areas: 🗂️ Data Engineering Lifecycle: Learn the important stages of building an efficient data pipeline that creates business value. 📥 Data Ingestion: Learn how to efficiently gather data from various sources. 💾 Data Storage: Master the techniques for storing data securely and cost-effectively. 🔄 Data Transformation: Understand how to clean, organize, and prepare data for analysis and machine learning. 🏗️ Data Architecture Design: Build robust architectures that support scalable, efficient data workflows. 📊 Serving Data: Ensure that data is available to stakeholders when and where they need it to drive business decisions. Enroll now!

Enrollment is now open for the Data Engineering Professional Certificate! Data engineers are the architects of modern organizations, ensuring data is reliable, accessible, and ready for analytics and machine learning. This professional certificate is tailored to equip you with the critical skills, through frameworks and hands-on practice, to excel in this role. Taught by industry expert Joe Reis, co-author of the best-selling book "Fundamentals of Data Engineering," along with 17 guest instructors from the data field, you will gain expertise to start and further your career in the high-demand field of data engineering. Key focus areas: 🗂️ Data Engineering Lifecycle: Learn the important stages of building an efficient data pipeline that creates business value. 📥 Data Ingestion: Learn how to efficiently gather data from various sources. 💾 Data Storage: Master the techniques for storing data securely and cost-effectively. 🔄 Data Transformation: Understand how to clean, organize, and prepare data for analysis and machine learning. 🏗️ Data Architecture Design: Build robust architectures that support scalable, efficient data workflows. 📊 Serving Data: Ensure that data is available to stakeholders when and where they need it to drive business decisions. Enroll now!

DeepLearning.AI

20,833 просмотров • 1 год назад

Traditional data pipelines don't work for RAG applications. There are 3 issues with them: 1. Traditional data engineering solutions are optimized to handle structured data. RAG applications rely primarily on unstructured data. 2. The connector ecosystem to load data from unstructured data sources is very immature. 3. Traditional solutions do not offer any way to transform unstructured data into an optimized vector search index. The goal of a RAG Pipeline is to solve these problems. The number one objective is to create a reliable vector search index using factual knowledge and relevant context. This sounds easy, but it's one of the biggest challenges we face when building RAG applications. At a high level, there are four different stages in the architecture of a RAG pipeline: 1. Ingestion: Here is where the pipeline loads the information from the data source. 2. Extraction: Where the pipeline processes the input data and decides how to retrieve the text contained inside them. 3. Transform: Where the pipeline chunks the data and generates document embeddings. 4. Load: Where the pipeline creates a search index in a vector database and loads the document embeddings. There are different rabbit holes at each one of these stages. Here are three of them: 1. Ingesting data once is simple. The hard part is refreshing the vector database whenever the original data source changes. 2. Extracting the content of a plain text document is simple. The hard part is to extract content from complex documents containing tables, images, or cross-references. 3. A simple continual chunking strategy with an overlap is simple. The hard part is to find the optimal strategy for your specific knowledge base and the way you are planning to query it. In the attached video, I'll show you how you can build an enterprise-grade RAG Pipeline that solves every one of the above problems. I'll use Vectorize. They partnered with me on this post. You can use them to build RAG pipelines optimized for accurate context retrieval. If you have a few documents lying around, set up a free account and give it a try.

Traditional data pipelines don't work for RAG applications. There are 3 issues with them: 1. Traditional data engineering solutions are optimized to handle structured data. RAG applications rely primarily on unstructured data. 2. The connector ecosystem to load data from unstructured data sources is very immature. 3. Traditional solutions do not offer any way to transform unstructured data into an optimized vector search index. The goal of a RAG Pipeline is to solve these problems. The number one objective is to create a reliable vector search index using factual knowledge and relevant context. This sounds easy, but it's one of the biggest challenges we face when building RAG applications. At a high level, there are four different stages in the architecture of a RAG pipeline: 1. Ingestion: Here is where the pipeline loads the information from the data source. 2. Extraction: Where the pipeline processes the input data and decides how to retrieve the text contained inside them. 3. Transform: Where the pipeline chunks the data and generates document embeddings. 4. Load: Where the pipeline creates a search index in a vector database and loads the document embeddings. There are different rabbit holes at each one of these stages. Here are three of them: 1. Ingesting data once is simple. The hard part is refreshing the vector database whenever the original data source changes. 2. Extracting the content of a plain text document is simple. The hard part is to extract content from complex documents containing tables, images, or cross-references. 3. A simple continual chunking strategy with an overlap is simple. The hard part is to find the optimal strategy for your specific knowledge base and the way you are planning to query it. In the attached video, I'll show you how you can build an enterprise-grade RAG Pipeline that solves every one of the above problems. I'll use Vectorize. They partnered with me on this post. You can use them to build RAG pipelines optimized for accurate context retrieval. If you have a few documents lying around, set up a free account and give it a try.

Santiago

40,441 просмотров • 1 год назад

Major program launch: Data Analytics Professional Certificate! This large, five-course sequence takes you all the way to being job-ready as a data analyst, and shows how to use Generative AI as a thought partner to enhance your work in this role. Offered by on Coursera, this is taught by Sean Barnes, Ph.D., a Data Science & Engineering Leader at Netflix. Analyzing data remains one of the most important skills in where the world is going with AI. This comprehensive certificate takes you all the way to being job-ready. Each course comes with practical projects demonstrated in real-world contexts, such as analyzing sales data for a Korean bakery, video game sales trends across different regions, or identifying factors impacting customer retention for a communications company. You'll also work on estimating fire distribution for forest fire prevention, analyzing how a diamond's properties affect its market value, and developing predictive models for retail sales analysis, carbon emissions, and coral reef conservation. Here's some of what you'll learn: - How to define data and categorize it into its many types such as discrete & continuous numerical, structured & unstructured, time series, categorical, and know what insights can be derived from the different types of data categories. - How to differentiate between data-related job roles and their responsibilities, and how data flows through an organization from the moment of capture to decision-making. - How to perform data processing functions and apply conditional formatting in spreadsheets to extract business value from your data using statistical calculations and best practices for visualizing and interpreting data. - How to use LLMs for stakeholder analysis, data exploration, and data visualization. - Best practices for using LLMs for as a thought partner to data analysis work By the end of this professional certificate program, you will have learned core statistical concepts, analysis techniques, and visualization methodologies that will serve as the foundation for working as a data analyst. The world needs more data analysts, especially ones who know how to use modern generative AI. With data science roles projected to grow 36% by 2033, the skills taught in this program create new professional opportunities in data. Sign up here!

Major program launch: Data Analytics Professional Certificate! This large, five-course sequence takes you all the way to being job-ready as a data analyst, and shows how to use Generative AI as a thought partner to enhance your work in this role. Offered by on Coursera, this is taught by Sean Barnes, Ph.D., a Data Science & Engineering Leader at Netflix. Analyzing data remains one of the most important skills in where the world is going with AI. This comprehensive certificate takes you all the way to being job-ready. Each course comes with practical projects demonstrated in real-world contexts, such as analyzing sales data for a Korean bakery, video game sales trends across different regions, or identifying factors impacting customer retention for a communications company. You'll also work on estimating fire distribution for forest fire prevention, analyzing how a diamond's properties affect its market value, and developing predictive models for retail sales analysis, carbon emissions, and coral reef conservation. Here's some of what you'll learn: - How to define data and categorize it into its many types such as discrete & continuous numerical, structured & unstructured, time series, categorical, and know what insights can be derived from the different types of data categories. - How to differentiate between data-related job roles and their responsibilities, and how data flows through an organization from the moment of capture to decision-making. - How to perform data processing functions and apply conditional formatting in spreadsheets to extract business value from your data using statistical calculations and best practices for visualizing and interpreting data. - How to use LLMs for stakeholder analysis, data exploration, and data visualization. - Best practices for using LLMs for as a thought partner to data analysis work By the end of this professional certificate program, you will have learned core statistical concepts, analysis techniques, and visualization methodologies that will serve as the foundation for working as a data analyst. The world needs more data analysts, especially ones who know how to use modern generative AI. With data science roles projected to grow 36% by 2033, the skills taught in this program create new professional opportunities in data. Sign up here!

Andrew Ng

84,686 просмотров • 1 год назад

We just launched a major new Data Engineering Professional Certificate on Coursera! Data underlies all modern AI systems, and engineers who know how to build systems to store and serve it are in high demand. If you're interested in learning this skill, please check out this 4-course sequence, which is designed to make you job-ready to be a Data Engineer. This is a new specialization taught by Joe Reis, the co-author of the best-selling book “Fundamentals of Data Engineering," in collaboration with AWS. (Disclosure, I serve on Amazon's board.) For many AI systems, data engineering is 80% of the work, and modeling is 20%. But people’s attention on these two topics is often flipped. This makes the job of the data engineer particularly important. In this professional certificate, you'll learn foundational data engineering skills while implementing modern data architectures using open-source tools: - Learn the key steps of the data lifecycle, to generate, ingest, store, transform, and serve data. - Learn to align with organizational goals to design the data pipeline right for your business' needs. - Understand how to make necessary trade-offs between speed, scalability, security, and cost. Joe has distilled into this specialization decades of experience helping startups and large companies with data infrastructure. He is also joined by 17 other industry leaders in the data field, who will help you learn in-demand skills for the growing field of data engineering. Please sign up here:

We just launched a major new Data Engineering Professional Certificate on Coursera! Data underlies all modern AI systems, and engineers who know how to build systems to store and serve it are in high demand. If you're interested in learning this skill, please check out this 4-course sequence, which is designed to make you job-ready to be a Data Engineer. This is a new specialization taught by Joe Reis, the co-author of the best-selling book “Fundamentals of Data Engineering," in collaboration with AWS. (Disclosure, I serve on Amazon's board.) For many AI systems, data engineering is 80% of the work, and modeling is 20%. But people’s attention on these two topics is often flipped. This makes the job of the data engineer particularly important. In this professional certificate, you'll learn foundational data engineering skills while implementing modern data architectures using open-source tools: - Learn the key steps of the data lifecycle, to generate, ingest, store, transform, and serve data. - Learn to align with organizational goals to design the data pipeline right for your business' needs. - Understand how to make necessary trade-offs between speed, scalability, security, and cost. Joe has distilled into this specialization decades of experience helping startups and large companies with data infrastructure. He is also joined by 17 other industry leaders in the data field, who will help you learn in-demand skills for the growing field of data engineering. Please sign up here:

Andrew Ng

118,937 просмотров • 1 год назад

DataTransfer iOS app wip. Ver 1.0.0 Perfect for Anyone that wants to quickly, easily, and reliably pull out all media data from device and can use to restore on another iOS device. Made for iOS only. Mainly started this project because there’s no app I can reliably use without it crashing or too much $ for a simple data backup, plus I need something like this for iOS data recovery situations when customers come in with iOS device for data transfer but FMI on and they cannot remember the login. So this way I can backup data FAST and have multiple ways to transfer the backup files, (Air drop wired or wireless, iCloud, etc) Also, main focus for this app will be elegant UI and high reliability and speed for backups.

DataTransfer iOS app wip. Ver 1.0.0 Perfect for Anyone that wants to quickly, easily, and reliably pull out all media data from device and can use to restore on another iOS device. Made for iOS only. Mainly started this project because there’s no app I can reliably use without it crashing or too much $ for a simple data backup, plus I need something like this for iOS data recovery situations when customers come in with iOS device for data transfer but FMI on and they cannot remember the login. So this way I can backup data FAST and have multiple ways to transfer the backup files, (Air drop wired or wireless, iCloud, etc) Also, main focus for this app will be elegant UI and high reliability and speed for backups.

Mr. Creator - EuphoriaTools.com

20,698 просмотров • 1 год назад

$BTC Statistical Study using Claude - A Beginner's Workflow Here's an example of a z-score study on $BTC - still tinkering so don't take this as overtly useful information but the creation of a dashboard for the visualization of statistical data is phenomenal. my current workflow: > import $BTC time data - .csv file (can get this information from multiple venues - Binance is where I got mine) > creating bins - if you have a larger data set - you can use recent data to prevent overt bias in the long direction or filter consolidation and trending regime data into separate bins for statistical analysis - however, you will have to define thresholds and determine what that entails. > defining metrics in Claude that you want to use for statistical analysis e.g. for z-score what is it based on and what type of calculation? make sure you understand the calculations being performed for any metrics that you are doing a study for and modify them accordingly. > prompting Claude to do a statistical analysis with specific instructions and then tell it to create visualization for this. I've been messing around with this and I'm seriously impressed by the output.

$BTC Statistical Study using Claude - A Beginner's Workflow Here's an example of a z-score study on $BTC - still tinkering so don't take this as overtly useful information but the creation of a dashboard for the visualization of statistical data is phenomenal. my current workflow: > import $BTC time data - .csv file (can get this information from multiple venues - Binance is where I got mine) > creating bins - if you have a larger data set - you can use recent data to prevent overt bias in the long direction or filter consolidation and trending regime data into separate bins for statistical analysis - however, you will have to define thresholds and determine what that entails. > defining metrics in Claude that you want to use for statistical analysis e.g. for z-score what is it based on and what type of calculation? make sure you understand the calculations being performed for any metrics that you are doing a study for and modify them accordingly. > prompting Claude to do a statistical analysis with specific instructions and then tell it to create visualization for this. I've been messing around with this and I'm seriously impressed by the output.

Stoic

19,757 просмотров • 1 год назад

SQL has levels to it: - level 1 SELECT, FROM, WHERE, GROUP BY, HAVING, LIMIT Master these basic keywords and you’ll be well on your way to mastering SQL. - level 2 Mastering JOINs: Most common JOINs: INNER and LEFT Less common JOINs: FULL OUTER Joins you should avoid almost always: RIGHT and CROSS JOIN Mastering common table expressions (CTEs). The WITH keyword defines a CTE which you can imagine as a “variable” that you can query later. Using variables like this you can master algorithm techniques like recursion, breadth first search and more! CTEs also make your SQL much more readable and make your coworkers hate you less compared to nested sub queries. - level 3 Mastering window functions Window functions have 3 pieces: The function (i.e. SUM, RANK, AVG) The over clause to start the window The window definition which has 3 pieces: - how to split the window up with PARTITION BY - how to order the window with ORDER BY - how to restrict the window size with ROWS clause (useful for rolling monthly averages) Understand RANK vs DENSE_RANK vs ROW_NUMBER, I have been asked this in interviews a million times. - level 4 You understand table scans, b-tree indexes, and partitioning schemes to increase performance. Doing something like COUNT(CASE WHEN) is much better than doing multiple queries with a UNION ALL. UNION ALL is terrible for all sorts of reasons that I don’t want to get into in this post. B-trees indexes allow for efficient scanning of data in the WHERE clause. Use explain plans to understand if an index is actually being used or not! Partitioning is similar to indexes except it’s a “poor mans” index. It just keeps data in specific folders and skips the folders that don’t include the data I question. What else did I miss for mastering SQL?

SQL has levels to it: - level 1 SELECT, FROM, WHERE, GROUP BY, HAVING, LIMIT Master these basic keywords and you’ll be well on your way to mastering SQL. - level 2 Mastering JOINs: Most common JOINs: INNER and LEFT Less common JOINs: FULL OUTER Joins you should avoid almost always: RIGHT and CROSS JOIN Mastering common table expressions (CTEs). The WITH keyword defines a CTE which you can imagine as a “variable” that you can query later. Using variables like this you can master algorithm techniques like recursion, breadth first search and more! CTEs also make your SQL much more readable and make your coworkers hate you less compared to nested sub queries. - level 3 Mastering window functions Window functions have 3 pieces: The function (i.e. SUM, RANK, AVG) The over clause to start the window The window definition which has 3 pieces: - how to split the window up with PARTITION BY - how to order the window with ORDER BY - how to restrict the window size with ROWS clause (useful for rolling monthly averages) Understand RANK vs DENSE_RANK vs ROW_NUMBER, I have been asked this in interviews a million times. - level 4 You understand table scans, b-tree indexes, and partitioning schemes to increase performance. Doing something like COUNT(CASE WHEN) is much better than doing multiple queries with a UNION ALL. UNION ALL is terrible for all sorts of reasons that I don’t want to get into in this post. B-trees indexes allow for efficient scanning of data in the WHERE clause. Use explain plans to understand if an index is actually being used or not! Partitioning is similar to indexes except it’s a “poor mans” index. It just keeps data in specific folders and skips the folders that don’t include the data I question. What else did I miss for mastering SQL?

Zach Wilson

79,538 просмотров • 1 год назад

SQL has levels to it: - level 1 SELECT, FROM, WHERE, GROUP BY, HAVING, LIMIT Master these basic keywords and you’ll be well on your way to mastering SQL. - level 2 Mastering JOINs: Most common JOINs: INNER and LEFT Less common JOINs: FULL OUTER Joins you should avoid almost always: RIGHT and CROSS JOIN Mastering common table expressions (CTEs). The WITH keyword defines a CTE which you can imagine as a “variable” that you can query later. Using variables like this you can master algorithm techniques like recursion, breadth first search and more! CTEs also make your SQL much more readable and make your coworkers hate you less compared to nested sub queries. - level 3 Mastering window functions Window functions have 3 pieces: The function (i.e. SUM, RANK, AVG) The over clause to start the window The window definition which has 3 pieces: - how to split the window up with PARTITION BY - how to order the window with ORDER BY - how to restrict the window size with ROWS clause (useful for rolling monthly averages) Understand RANK vs DENSE_RANK vs ROW_NUMBER, I have been asked this in interviews a million times. - level 4 You understand table scans, b-tree indexes, and partitioning schemes to increase performance. Doing something like COUNT(CASE WHEN) is much better than doing multiple queries with a UNION ALL. UNION ALL is terrible for all sorts of reasons that I don’t want to get into in this post. B-trees indexes allow for efficient scanning of data in the WHERE clause. Use explain plans to understand if an index is actually being used or not! Partitioning is similar to indexes except it’s a “poor mans” index. It just keeps data in specific folders and skips the folders that don’t include the data I question. What else did I miss for mastering SQL?

SQL has levels to it: - level 1 SELECT, FROM, WHERE, GROUP BY, HAVING, LIMIT Master these basic keywords and you’ll be well on your way to mastering SQL. - level 2 Mastering JOINs: Most common JOINs: INNER and LEFT Less common JOINs: FULL OUTER Joins you should avoid almost always: RIGHT and CROSS JOIN Mastering common table expressions (CTEs). The WITH keyword defines a CTE which you can imagine as a “variable” that you can query later. Using variables like this you can master algorithm techniques like recursion, breadth first search and more! CTEs also make your SQL much more readable and make your coworkers hate you less compared to nested sub queries. - level 3 Mastering window functions Window functions have 3 pieces: The function (i.e. SUM, RANK, AVG) The over clause to start the window The window definition which has 3 pieces: - how to split the window up with PARTITION BY - how to order the window with ORDER BY - how to restrict the window size with ROWS clause (useful for rolling monthly averages) Understand RANK vs DENSE_RANK vs ROW_NUMBER, I have been asked this in interviews a million times. - level 4 You understand table scans, b-tree indexes, and partitioning schemes to increase performance. Doing something like COUNT(CASE WHEN) is much better than doing multiple queries with a UNION ALL. UNION ALL is terrible for all sorts of reasons that I don’t want to get into in this post. B-trees indexes allow for efficient scanning of data in the WHERE clause. Use explain plans to understand if an index is actually being used or not! Partitioning is similar to indexes except it’s a “poor mans” index. It just keeps data in specific folders and skips the folders that don’t include the data I question. What else did I miss for mastering SQL?

Zach Wilson

34,010 просмотров • 2 месяцев назад

Your agents can't keep up with real-time data. Especially when it's scattered across dozens of sources. Most teams waste weeks building custom connectors for every database, API, and data warehouse. Then they build ETL pipelines to sync everything. By the time your agent retrieves the data, it's already outdated. Picture this: Your Postgres database updated 5 minutes ago. Your MongoDB collection changed 2 minutes ago. Your agent is still pulling from yesterday's snapshot. This is why most production RAG systems fail. There's a better approach: MindsDB is an open-source AI platform with a federated data engine that lets you query multiple data sources in real-time using SQL - without moving any data. Here's what makes it different: ↳ Your data stays in place. No ETL pipelines or data duplication ↳ Query Postgres, MongoDB, REST APIs, and more using consistent SQL ↳ JOIN across different sources in real-time with a unified interface ↳ Works with both structured and un-structured data And here's the best part: You don't even need to write SQL. Just describe what you want in plain English, and MindsDB converts it to SQL automatically. The system does all the heavy lifting. The breakthrough for AI agents is simple: When data updates at the source, your agent gets fresh results immediately. No sync delays. No stale embeddings. No custom code for each integration. You can literally write a SQL query that joins a Postgres table with a MongoDB collection and gets live results. This is what production AI applications need but rarely get. In this video, I give you a complete walkthrough of what we just discussed and how to actually do it. Make sure you watch this till the end. I've shared the link to MindsDB's GitHub repo in the next tweet!

Your agents can't keep up with real-time data. Especially when it's scattered across dozens of sources. Most teams waste weeks building custom connectors for every database, API, and data warehouse. Then they build ETL pipelines to sync everything. By the time your agent retrieves the data, it's already outdated. Picture this: Your Postgres database updated 5 minutes ago. Your MongoDB collection changed 2 minutes ago. Your agent is still pulling from yesterday's snapshot. This is why most production RAG systems fail. There's a better approach: MindsDB is an open-source AI platform with a federated data engine that lets you query multiple data sources in real-time using SQL - without moving any data. Here's what makes it different: ↳ Your data stays in place. No ETL pipelines or data duplication ↳ Query Postgres, MongoDB, REST APIs, and more using consistent SQL ↳ JOIN across different sources in real-time with a unified interface ↳ Works with both structured and un-structured data And here's the best part: You don't even need to write SQL. Just describe what you want in plain English, and MindsDB converts it to SQL automatically. The system does all the heavy lifting. The breakthrough for AI agents is simple: When data updates at the source, your agent gets fresh results immediately. No sync delays. No stale embeddings. No custom code for each integration. You can literally write a SQL query that joins a Postgres table with a MongoDB collection and gets live results. This is what production AI applications need but rarely get. In this video, I give you a complete walkthrough of what we just discussed and how to actually do it. Make sure you watch this till the end. I've shared the link to MindsDB's GitHub repo in the next tweet!

Akshay 🚀

65,672 просмотров • 8 месяцев назад

Data teams spend weeks on simple requests. (This AI answers them in minutes.) Most data analysis is repetitive manual tasks. Data teams spend more time on setup than actual analysis. The workflow usually looks like this: → Run some exploratory data analysis in a local Jupyter notebook or environment → Pull data from multiple disconnected sources → Write code from scratch for every analysis → Export static charts that stakeholders can't explore (or wrestle with legacy BI to create a dashboard) → Manually send updates via email or Slack when data changes → Start over for each new request Most teams accept this as "how data analysis works." While business decisions wait for insights. That's where Fabi changes the entire approach. It's a powerful, AI-native platform built for teams that want to boost productivity and supercharge their data workflows. Instead of working on separate tools and manual processes, you collaborate on analysis that automatically delivers insights where teams work. Here's what makes Fabi different: AI-Native Analysis Environment ↳ SQL and Python work together with AI assistance that handles coding and debugging automatically. Smart Automation Workflows ↳ Automatically send AI-powered reports and summaries right where business works in Slack, email, and spreadsheets. Universal Data Integration ↳ Analyze data from files, Google Sheets, Airtable, plus your data warehouse and databases in one place. Collaborative Data Apps ↳ Create interactive dashboards that stakeholders can explore and ask follow-up questions directly. What you can do with Fabi that legacy BI can't: ➟ Send AI-generated insights directly to Slack channels ➟ Automatically email data summaries to stakeholders ➟ Analyze uploaded files without complex ETL processes ➟ Collaborate on analysis like Google Docs for data ➟ Build workflows that push insights to spreadsheets Perfect for teams that want to move beyond the constraints of legacy and increase their impact. Teams using Fabi see immediate results: ✓ Insights delivered in minutes instead of days ✓ Reduced context switching between tools ✓ Stakeholders explore data independently ✓ Workflows automated to save hours of manual work From analysis to automated delivery - all in one AI-native environment. 📌 Try Fabi today: 👉 Follow Fabi.ai and marc for Fabi updates. 🔄 Repost to help other teams streamline data analysis #DataAnalysis #ModernBI #DataOps #InteractiveDashboards #FabiPartnership #SponsoredByFabi

Data teams spend weeks on simple requests. (This AI answers them in minutes.) Most data analysis is repetitive manual tasks. Data teams spend more time on setup than actual analysis. The workflow usually looks like this: → Run some exploratory data analysis in a local Jupyter notebook or environment → Pull data from multiple disconnected sources → Write code from scratch for every analysis → Export static charts that stakeholders can't explore (or wrestle with legacy BI to create a dashboard) → Manually send updates via email or Slack when data changes → Start over for each new request Most teams accept this as "how data analysis works." While business decisions wait for insights. That's where Fabi changes the entire approach. It's a powerful, AI-native platform built for teams that want to boost productivity and supercharge their data workflows. Instead of working on separate tools and manual processes, you collaborate on analysis that automatically delivers insights where teams work. Here's what makes Fabi different: AI-Native Analysis Environment ↳ SQL and Python work together with AI assistance that handles coding and debugging automatically. Smart Automation Workflows ↳ Automatically send AI-powered reports and summaries right where business works in Slack, email, and spreadsheets. Universal Data Integration ↳ Analyze data from files, Google Sheets, Airtable, plus your data warehouse and databases in one place. Collaborative Data Apps ↳ Create interactive dashboards that stakeholders can explore and ask follow-up questions directly. What you can do with Fabi that legacy BI can't: ➟ Send AI-generated insights directly to Slack channels ➟ Automatically email data summaries to stakeholders ➟ Analyze uploaded files without complex ETL processes ➟ Collaborate on analysis like Google Docs for data ➟ Build workflows that push insights to spreadsheets Perfect for teams that want to move beyond the constraints of legacy and increase their impact. Teams using Fabi see immediate results: ✓ Insights delivered in minutes instead of days ✓ Reduced context switching between tools ✓ Stakeholders explore data independently ✓ Workflows automated to save hours of manual work From analysis to automated delivery - all in one AI-native environment. 📌 Try Fabi today: 👉 Follow Fabi.ai and marc for Fabi updates. 🔄 Repost to help other teams streamline data analysis #DataAnalysis #ModernBI #DataOps #InteractiveDashboards #FabiPartnership #SponsoredByFabi

Andrew Bolis

36,504 просмотров • 10 месяцев назад

Today, Box is announcing major new AI agent capabilities to let customers tap into the full value of their unstructured data. First, we’re announcing all new updates to the Box AI Studio to make it even easier to build AI agents that tap into your enterprise content for any job function, business process, or industry specific use case. We are also expanding our set of foundational agents that customers will be able to use to work with their enterprise content, including new features like search and research on unstructured data. Next, we’re announcing Box Extract to enable customers to use AI agents seamlessly for complex data extraction from any type of document or content. This makes it easier than ever to pull out data from contracts, invoices, research data, marketing assets, medical charts, and more. Finally, we’re introducing Box Automate, a new workflow automation solution within Box that lets you deploy AI agents across enterprise content-centric workflows. With Box Automate, you can design your business process in a simple drag and drop builder and then drop in AI agents at any step in the process. This ensures agents execute tasks at the right steps in a workflow every time. Best of all, our AI agents and workflow tools are designed to work across any system our customers work within, whether it’s leveraging pre-built integrations, Box APIs, or the new Box MCP Server. Ultimately, all of these capabilities come together to transform how companies can work with their enterprise content. Software has historically only been good at automating work that deals with structured data, which is why ERP, CRM, and HR systems have been mainstays of enterprise software for so long. The data in these systems fits neatly into a database, and the workflows are very ripe for automation. But it turns out most of the work in the world deals with unstructured data. It’s ideating through research documents, working with a client on contracts, reviewing details for a new product launch, looking at a patient’s healthcare record to make a diagnosis, working through due diligence documents for an M&A deal, and so on. For the first time ever, we can begin to bring all new insights and automation to this work with AI agents. At Box, we’re incredibly excited to be on this journey to help customers transform how they work with their most important data.

Today, Box is announcing major new AI agent capabilities to let customers tap into the full value of their unstructured data. First, we’re announcing all new updates to the Box AI Studio to make it even easier to build AI agents that tap into your enterprise content for any job function, business process, or industry specific use case. We are also expanding our set of foundational agents that customers will be able to use to work with their enterprise content, including new features like search and research on unstructured data. Next, we’re announcing Box Extract to enable customers to use AI agents seamlessly for complex data extraction from any type of document or content. This makes it easier than ever to pull out data from contracts, invoices, research data, marketing assets, medical charts, and more. Finally, we’re introducing Box Automate, a new workflow automation solution within Box that lets you deploy AI agents across enterprise content-centric workflows. With Box Automate, you can design your business process in a simple drag and drop builder and then drop in AI agents at any step in the process. This ensures agents execute tasks at the right steps in a workflow every time. Best of all, our AI agents and workflow tools are designed to work across any system our customers work within, whether it’s leveraging pre-built integrations, Box APIs, or the new Box MCP Server. Ultimately, all of these capabilities come together to transform how companies can work with their enterprise content. Software has historically only been good at automating work that deals with structured data, which is why ERP, CRM, and HR systems have been mainstays of enterprise software for so long. The data in these systems fits neatly into a database, and the workflows are very ripe for automation. But it turns out most of the work in the world deals with unstructured data. It’s ideating through research documents, working with a client on contracts, reviewing details for a new product launch, looking at a patient’s healthcare record to make a diagnosis, working through due diligence documents for an M&A deal, and so on. For the first time ever, we can begin to bring all new insights and automation to this work with AI agents. At Box, we’re incredibly excited to be on this journey to help customers transform how they work with their most important data.

Aaron Levie

91,863 просмотров • 10 месяцев назад

How to extract data from papers for literature review in seconds? 1. Go to 2. Upload 2-3 papers you already know are relevant 3. Start with clicking on one paper 4. ResearchCollab do the following for the paper ✦ Provide a short overview of the paper ✦ Identify weaknesses in the paper ✦ Share papers that contrast the existing paper ✦ Extract meta data about the paper ✦ Evaluate each part (e.g., methodology) of the paper ✦ Provide AI chat to extract any other data 5. Now click on the Related Papers button at top right 6. You will get papers related to the seed paper 7. You can click on each paper and read its abstract 8. If you find it relevant, just add it to your list of papers 9. Do the same for the other 2 seed papers 10. This way you will collect all relevant papers 11. And extract data from those papers After this, analyse the data and report the findings. Try ResearchCollab today:

How to extract data from papers for literature review in seconds? 1. Go to 2. Upload 2-3 papers you already know are relevant 3. Start with clicking on one paper 4. ResearchCollab do the following for the paper ✦ Provide a short overview of the paper ✦ Identify weaknesses in the paper ✦ Share papers that contrast the existing paper ✦ Extract meta data about the paper ✦ Evaluate each part (e.g., methodology) of the paper ✦ Provide AI chat to extract any other data 5. Now click on the Related Papers button at top right 6. You will get papers related to the seed paper 7. You can click on each paper and read its abstract 8. If you find it relevant, just add it to your list of papers 9. Do the same for the other 2 seed papers 10. This way you will collect all relevant papers 11. And extract data from those papers After this, analyse the data and report the findings. Try ResearchCollab today:

Faheem Ullah

26,494 просмотров • 5 месяцев назад

If you could only learn one thing that will be relevant for the next 10-20 years, focus on learning how to deal with data. The future is not about faster hardware, smarter algorithms, or better ideas. The future is about DATA, and those who know how to deal with it will stay relevant much longer than anyone else. I recorded a video to show you how easy it is to get started. In the video, I'm using Kestra. For a long time, I was a fan of AirFlow. Then, I moved to AWS Step Functions. Today, I only use Kestra. Kestra is open-source (repo link below) and kind enough to sponsor my work. The video will show you how easy it is to do the following: 1. Run Kestra locally (literally, one command) 2. Build a simple flow 3. Run Python scripts as part of your flow 4. Connect to HuggingFace models If you have never built a data pipeline, open Kestra's Quick Start Guide and follow their examples. (I think it will take you one weekend to feel comfortable with the application and build the courage you need to get into more serious work.)

If you could only learn one thing that will be relevant for the next 10-20 years, focus on learning how to deal with data. The future is not about faster hardware, smarter algorithms, or better ideas. The future is about DATA, and those who know how to deal with it will stay relevant much longer than anyone else. I recorded a video to show you how easy it is to get started. In the video, I'm using Kestra. For a long time, I was a fan of AirFlow. Then, I moved to AWS Step Functions. Today, I only use Kestra. Kestra is open-source (repo link below) and kind enough to sponsor my work. The video will show you how easy it is to do the following: 1. Run Kestra locally (literally, one command) 2. Build a simple flow 3. Run Python scripts as part of your flow 4. Connect to HuggingFace models If you have never built a data pipeline, open Kestra's Quick Start Guide and follow their examples. (I think it will take you one weekend to feel comfortable with the application and build the courage you need to get into more serious work.)

Santiago

51,012 просмотров • 1 год назад

most people stick to just 1 or 2 tables they are familiar with when writing their SQL queries in we built the ability to search for tables, preview tables and easily sample rows from a table so you can find the ones you need + even discover new tables data discovery is a big problem even in small companies. most data doesn't get analyzed because most of company doesn't know it exists

most people stick to just 1 or 2 tables they are familiar with when writing their SQL queries in we built the ability to search for tables, preview tables and easily sample rows from a table so you can find the ones you need + even discover new tables data discovery is a big problem even in small companies. most data doesn't get analyzed because most of company doesn't know it exists

rahul

11,550 просмотров • 11 месяцев назад

No longer a ‘conspiracy theory’ Online retailers are accessing your data and then adjusting prices for items based on that data Airlines are being caught adjusting prices based on data too “This holiday season, you and your neighbor could be shopping for the exact same item and pay different prices — surveillance pricing is and how your personal data plays a role. You may be shopping for gifts from the comfort of your couch, but stores can still see who you are. Most of these companies are charging you a different price based on some data they have about you. It could be your location, it could be your proximity to the store, it could be your shopping history, knowing that you bought a product previously and are willing pay for it again.” “Consumer watchdog president Jamie Court says surveillance pricing is nearly impossible to avoid — Court says his organization has evidence all the major retailers use personal algorithms to nudge a price up or down, targeting data from your Amazon cart, Uber rides, or even the brand of laptop you're using to shop” “For those traveling for the holidays. Airlines will charge you a different price depending upon how many times you go to the website. “Currently, there is no federal regulation for companies that use surveillance pricing. We reached out to the FTC”

Wall Street Apes

518,986 просмотров • 7 месяцев назад

Introducing the New Silencio Map Explorer. Explore real-world noise levels across streets, hotels and restaurants. All powered by community-contributed data from over 1 million sensors. This isn't just a map; it's a window into community-powered data that allows you to: • Assess outdoor noise levels globally using Street, Hotel, and Restaurant data layers. • Identify potentially quieter areas for living or working. • Observe noise level trends within specific locations. We’re making decentralized data both accessible and actionable. And this is just the beginning. Coming soon: • AI-driven analysis for deeper environmental understanding. • Noise source identification capabilities. • Downloadable, detailed environmental reports for specific areas. This is what DePIN should feel like. No speculation. Just real-world tools powered by real people. Start exploring:

Introducing the New Silencio Map Explorer. Explore real-world noise levels across streets, hotels and restaurants. All powered by community-contributed data from over 1 million sensors. This isn't just a map; it's a window into community-powered data that allows you to: • Assess outdoor noise levels globally using Street, Hotel, and Restaurant data layers. • Identify potentially quieter areas for living or working. • Observe noise level trends within specific locations. We’re making decentralized data both accessible and actionable. And this is just the beginning. Coming soon: • AI-driven analysis for deeper environmental understanding. • Noise source identification capabilities. • Downloadable, detailed environmental reports for specific areas. This is what DePIN should feel like. No speculation. Just real-world tools powered by real people. Start exploring:

Silencio | Voice Data for AI

15,221 просмотров • 1 год назад

🚨 BREAKING: Typeless as a tool is Now HIPAA & GDPR Compliant and giving everyday users total control over our data! Whether you're a professional or an everyday user of Typeless, you can now feel confident in the security and safety of #Typeless, which has successfully met the rigorous high level privacy & safety standards set in both the US and the EU. — HIPAA (Health Insurance Portability and Accountability Act) compliance guarantees that Typeless manages protected health information (PHI) with the utmost privacy and security standards required in the healthcare sector. — The General Data Protection Regulation (GDPR) is a robust data protection law that outlines how organizations should handle the collection, storage, and processing of personal data belonging to EU residents. Typeless' compliance underscores its dedication to safeguarding user privacy and data rights. The 3 Privacy Pillars: 1. Zero cloud data retention 2. Never trained on your data 3. On-device history storage 🔗 To learn more, visit the Typeless Trust Center at 👉 for detailed compliance policies.

🚨 BREAKING: Typeless as a tool is Now HIPAA & GDPR Compliant and giving everyday users total control over our data! Whether you're a professional or an everyday user of Typeless, you can now feel confident in the security and safety of #Typeless, which has successfully met the rigorous high level privacy & safety standards set in both the US and the EU. — HIPAA (Health Insurance Portability and Accountability Act) compliance guarantees that Typeless manages protected health information (PHI) with the utmost privacy and security standards required in the healthcare sector. — The General Data Protection Regulation (GDPR) is a robust data protection law that outlines how organizations should handle the collection, storage, and processing of personal data belonging to EU residents. Typeless' compliance underscores its dedication to safeguarding user privacy and data rights. The 3 Privacy Pillars: 1. Zero cloud data retention 2. Never trained on your data 3. On-device history storage 🔗 To learn more, visit the Typeless Trust Center at 👉 for detailed compliance policies.

Wesley

285,526 просмотров • 4 месяцев назад

Right now the main paradigm that we think of agents in is chatting back and forth, but the biggest use of tokens will come from agents that are just always on running in the background doing work for us, or ones triggered from a workflow. Agents will be working 24/7 in our workflows processing data, reviewing and generating documents, moving data between systems, writing code, accelerating decision making steps, and more. In Claude's new Managed Agents feature, in a couple minutes you can wire up an agent that can read contracts when they come into Box to review them, and then assign a task in Linear with the critical information from the contract. But this could have been any workflow, like reviewing documents for client onboarding, invoice processing, M&A due-diligence, data extraction pipelines, and millions of other use-cases. And integrating data across any system. This is only possible when you can have long-running agents that can complete real work in the background, accurately. Agents have the ability to execute code safely, leverage tools, access a compute sandbox, and connect across systems is clearly the architecture of the future. The industry is now making it easier and easier for enterprises to build and deploy these agents.

Right now the main paradigm that we think of agents in is chatting back and forth, but the biggest use of tokens will come from agents that are just always on running in the background doing work for us, or ones triggered from a workflow. Agents will be working 24/7 in our workflows processing data, reviewing and generating documents, moving data between systems, writing code, accelerating decision making steps, and more. In Claude's new Managed Agents feature, in a couple minutes you can wire up an agent that can read contracts when they come into Box to review them, and then assign a task in Linear with the critical information from the contract. But this could have been any workflow, like reviewing documents for client onboarding, invoice processing, M&A due-diligence, data extraction pipelines, and millions of other use-cases. And integrating data across any system. This is only possible when you can have long-running agents that can complete real work in the background, accurately. Agents have the ability to execute code safely, leverage tools, access a compute sandbox, and connect across systems is clearly the architecture of the future. The industry is now making it easier and easier for enterprises to build and deploy these agents.

Aaron Levie

17,543 просмотров • 3 месяцев назад