Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

My new app is out !! ✨The Common Crawl Pipeline Creator ✨ Create your pipeline easily: ✔Run Text Extraction✂️ ✔Define Language Filters🌐 ✔Customize text quality💯 ✔See Live Results👀 ✔Get Python code 🐍 Based on famous LLM research like Gopher, C4 or FineWeb

Quentin Lhoest 🤗

4,759 subscribers

14,995 Aufrufe • vor 1 Jahr •via X (Twitter)

Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

7 Kommentare

Profilbild von Quentin Lhoest 🤗

Quentin Lhoest 🤗vor 1 Jahr

Keep an eye on the data you are filtering out, you don't want to discard interesting quality & diverse data ! You can also check how much data is filtered out at each step

Profilbild von Quentin Lhoest 🤗

Quentin Lhoest 🤗vor 1 Jahr

The python code is based on the `datatrove` library that was used to build the FineWeb dataset, which has a text quality so good it makes training LLMs faster !

Profilbild von Quentin Lhoest 🤗

Quentin Lhoest 🤗vor 1 Jahr

Try the app by yourself !

Profilbild von Sinclair Wang

Sinclair Wangvor 1 Jahr

@qlhoest awesome app!

Profilbild von hannah

hannahvor 1 Jahr

@qlhoest this is so cool!

Profilbild von Giocobon

Giocobonvor 1 Jahr

@qlhoest Amazing job! What is the aim of the code line "partial(increment_num_warc_samples, num_warc_samples_per_doc=2000 / 1687)" in the app code? I mean, why 2000 / 1687 ?

Profilbild von Quentin Lhoest 🤗

Quentin Lhoest 🤗vor 1 Jahr

I load the data from a cache from an intermediate step that has 300+ examples filtered out

Ähnliche Videos

We’ve launched new updates to Facebook Reels 🤩: ✔ Extend your reels to 90 seconds ✔ Easily share ready-made reels from Memories ✔ Use our new visual beat technology to sync videos with your favorite songs with Grooves ✔ Save time and get inspiration from Templates

We’ve launched new updates to Facebook Reels 🤩: ✔ Extend your reels to 90 seconds ✔ Easily share ready-made reels from Memories ✔ Use our new visual beat technology to sync videos with your favorite songs with Grooves ✔ Save time and get inspiration from Templates

Facebook

225,528 Aufrufe • vor 3 Jahren

Jung Kook 'GOLDEN' Official Merch. 📹Concept Film ✔BOMBER JACKET ✔L/S T-SHIRT ✔BRACELET ✔BLANKET ✔SHOULDER BAG ✔KEYRING ✔ACCESSORY TRAY ✔BADGE SET ✔PREMIUM PHOTO ✨ 상품 공개: 2023. 11. 15. 6PM(KST) 📅 판매 오픈: 2023. 11. 16. 6PM(KST) 📍Weverse Shop BTS JAPAN OFFICIAL #정국 #JungKook #JungKook_GOLDEN

Jung Kook 'GOLDEN' Official Merch. 📹Concept Film ✔BOMBER JACKET ✔L/S T-SHIRT ✔BRACELET ✔BLANKET ✔SHOULDER BAG ✔KEYRING ✔ACCESSORY TRAY ✔BADGE SET ✔PREMIUM PHOTO ✨ 상품 공개: 2023. 11. 15. 6PM(KST) 📅 판매 오픈: 2023. 11. 16. 6PM(KST) 📍Weverse Shop BTS JAPAN OFFICIAL #정국 #JungKook #JungKook_GOLDEN

HYBE MERCH

3,558,233 Aufrufe • vor 2 Jahren

Djalo ✨ ✔ Topu kapıyor ✔ Pas veriyor boşa kaçıyor pas alıyor ✔ Top sürüyor çalım atıyor ✔ Takımı hücuma çıkarıyor 6 numaraya koysan sırıtmaz

Djalo ✨ ✔ Topu kapıyor ✔ Pas veriyor boşa kaçıyor pas alıyor ✔ Top sürüyor çalım atıyor ✔ Takımı hücuma çıkarıyor 6 numaraya koysan sırıtmaz

KAAN GEDİK

107,640 Aufrufe • vor 4 Monaten

Emerging Player ✔ Match-winner ✔ Winner of 2 trophies ✔ Match-winner in final ✔ Most sixes for #ISLU ✔ Highest #ISLU run-scorer ✔ Captain of #ISLU ✔ LEGEND ✔ Asif Ali ticked all the boxes. One of the greatest #Sherus ever. You will always be part of this family. #UnitedWeWin #ISLULegend #ThankYouAsif

Emerging Player ✔ Match-winner ✔ Winner of 2 trophies ✔ Match-winner in final ✔ Most sixes for #ISLU ✔ Highest #ISLU run-scorer ✔ Captain of #ISLU ✔ LEGEND ✔ Asif Ali ticked all the boxes. One of the greatest #Sherus ever. You will always be part of this family. #UnitedWeWin #ISLULegend #ThankYouAsif

Islamabad United

105,447 Aufrufe • vor 2 Jahren

Come check out our crib, MTV style. ✔ 5,400 sq ft of mat space ✔ 600 sq ft cardio space ✔ updated locker room ✔ training/recovery room ✔ coach offices ✔ study room ✔ team/film room

Come check out our crib, MTV style. ✔ 5,400 sq ft of mat space ✔ 600 sq ft cardio space ✔ updated locker room ✔ training/recovery room ✔ coach offices ✔ study room ✔ team/film room

GWU Wrestling

26,032 Aufrufe • vor 2 Jahren

#DUCKADS tutorial for GFRIEND music show digital point next week! BUDDY, here's easy way for new user to collect votes on Duck Ads: ✔ Log in to your account ✔ Use "100219828046069274521" for your code Recommend ✔ Take attendance ✔ Watch Ads and Read Article

#DUCKADS tutorial for GFRIEND music show digital point next week! BUDDY, here's easy way for new user to collect votes on Duck Ads: ✔ Log in to your account ✔ Use "100219828046069274521" for your code Recommend ✔ Take attendance ✔ Watch Ads and Read Article

VIVIZ Voting Team 🦋

16,121 Aufrufe • vor 1 Jahr

PAGCOR LICENSED ✔✔✔ THE FIRST ONLINE POKER CASINO IN THE PHILIPPINES! To play, download BingoPlus app now! #BingoPlusPH #OnlineCasino #Poker

PAGCOR LICENSED ✔✔✔ THE FIRST ONLINE POKER CASINO IN THE PHILIPPINES! To play, download BingoPlus app now! #BingoPlusPH #OnlineCasino #Poker

BingoPlus

30,733,248 Aufrufe • vor 2 Jahren

I tested the Flexbar (2K OLED touch bar for Win/Mac/Linux) as a developer, here’s my take: It’s basically a programmable execution layer for your workflow. ✔ Run builds ✔ Trigger VSCode tasks ✔ Git add/commit/push in one tap ✔ Start/stop Docker ✔ Automate test → build → deploy sequences ✔ Display CPU / system metrics live The JavaScript-based SDK is the real power move you can connect your own APIs and turn it into a live dev dashboard. But if you’re the kind of dev who automates everything and optimizes micro-friction this is genuinely powerful.

I tested the Flexbar (2K OLED touch bar for Win/Mac/Linux) as a developer, here’s my take: It’s basically a programmable execution layer for your workflow. ✔ Run builds ✔ Trigger VSCode tasks ✔ Git add/commit/push in one tap ✔ Start/stop Docker ✔ Automate test → build → deploy sequences ✔ Display CPU / system metrics live The JavaScript-based SDK is the real power move you can connect your own APIs and turn it into a live dev dashboard. But if you’re the kind of dev who automates everything and optimizes micro-friction this is genuinely powerful.

• nanou •

212,145 Aufrufe • vor 3 Monaten

Disney+ ✔ Hulu ✔ Max ✔ The ultimate bundle for an unbeatable price.

Disney+ ✔ Hulu ✔ Max ✔ The ultimate bundle for an unbeatable price.

Disney+

90,367 Aufrufe • vor 1 Jahr

Quick (9 min) Berachain TESTNET interaction tutorial! *Potential $BERA Airdrop* 🐻 ✔ Add to MM ✔ Faucet ✔ Swap ✔ Mint $HONEY ✔ LP ✔ Lending ✔ Perps The Berachain Foundation 🐻⛓ testnet was only announced on Jan 11th! Test & qualify for a *potential* airdrop! Tutorial by @L1am_Crypto

Quick (9 min) Berachain TESTNET interaction tutorial! Potential $BERA Airdrop 🐻 ✔ Add to MM ✔ Faucet ✔ Swap ✔ Mint $HONEY ✔ LP ✔ Lending ✔ Perps The Berachain Foundation 🐻⛓ testnet was only announced on Jan 11th! Test & qualify for a potential airdrop! Tutorial by @L1am_Crypto

Stakecito Labs

349,308 Aufrufe • vor 2 Jahren

Benefits of Surya Namaskar (Sun Salutation) ☀️ ✔ Fat loss ✔ Better digestion ✔ Strong core ✔ Flexible spine ✔ Calm mind ✔ Glowing skin Try it for 21 days. Life won’t feel the same.

Benefits of Surya Namaskar (Sun Salutation) ☀️ ✔ Fat loss ✔ Better digestion ✔ Strong core ✔ Flexible spine ✔ Calm mind ✔ Glowing skin Try it for 21 days. Life won’t feel the same.

श्री

38,207 Aufrufe • vor 2 Monaten

I called every move since $BTC ATH $126K - cycle top (pinned) ✔ $63K - rally started ✔ $76K - Not the top ✔ $65K - long entry ✔ $81K - local top ✔ $79K - short entry ✔ $68K - 1st target ✔ And you still don't follow me?

I called every move since $BTC ATH $126K - cycle top (pinned) ✔ $63K - rally started ✔ $76K - Not the top ✔ $65K - long entry ✔ $81K - local top ✔ $79K - short entry ✔ $68K - 1st target ✔ And you still don't follow me?

Reflection🪩

587,020 Aufrufe • vor 13 Tagen

Dinner✔ Debrief✔ Drama✔ Don't miss RHOA every SUNDAY!

Dinner✔ Debrief✔ Drama✔ Don't miss RHOA every SUNDAY!

Bravo

1,389,387 Aufrufe • vor 22 Tagen

Finaly my first big Project - ITS INFINITY PVP (4 different pvp modes) ✔ All favorites game modes ✔ The Pit | Zone Wars | Box Fights | Lavabox ✔ Never ending PVP ✔ No wait time between deahs ✔ Kill streaks, Nukes, ACTION Like and Favorite ❤️

Finaly my first big Project - ITS INFINITY PVP (4 different pvp modes) ✔ All favorites game modes ✔ The Pit | Zone Wars | Box Fights | Lavabox ✔ Never ending PVP ✔ No wait time between deahs ✔ Kill streaks, Nukes, ACTION Like and Favorite ❤️

freemok

26,121 Aufrufe • vor 2 Jahren

#1 in distribution ✔ OG status since 2018 ✔ Full stack infrastructure ✔ Capturing 50% of USDC revenue ✔ Yeah, we're bullish on stablecoins.

#1 in distribution ✔ OG status since 2018 ✔ Full stack infrastructure ✔ Capturing 50% of USDC revenue ✔ Yeah, we're bullish on stablecoins.

Coinbase 🛡️

41,985 Aufrufe • vor 25 Tagen

Clawbot security is being massively misunderstood. If you’re running it on a VPS or locally, you’re exposing your entire system. Here’s the safer setup ↓ → VPS = full machine access → Email prompt injection = takeover risk → Plain-text API keys = nightmare The fix? Run Clawbot inside Cloudflare Workers using Moltworker. ✔ Isolated sandbox ✔ Zero Trust login ✔ No exposed keys ✔ Device approval required Save this video, you’ll avoid a painful security mistake. Want the SOP? DM me. 💬

Clawbot security is being massively misunderstood. If you’re running it on a VPS or locally, you’re exposing your entire system. Here’s the safer setup ↓ → VPS = full machine access → Email prompt injection = takeover risk → Plain-text API keys = nightmare The fix? Run Clawbot inside Cloudflare Workers using Moltworker. ✔ Isolated sandbox ✔ Zero Trust login ✔ No exposed keys ✔ Device approval required Save this video, you’ll avoid a painful security mistake. Want the SOP? DM me. 💬

Julian Goldie SEO

43,481 Aufrufe • vor 4 Monaten

Quarterhorse Mk 2.1 is coming alive. ✔ F100 full thrust firing ✔ Multiple drop tests ✔ Ground vibration testing ✔ Verification of vehicle-ground systems Fast on the ground and faster in the air.

Quarterhorse Mk 2.1 is coming alive. ✔ F100 full thrust firing ✔ Multiple drop tests ✔ Ground vibration testing ✔ Verification of vehicle-ground systems Fast on the ground and faster in the air.

Hermeus

94,185 Aufrufe • vor 4 Monaten

A quick substance search on Reaxys provides much more than documents: ✔ 250m+ Substances ✔ 72m+ Reactions ✔ 35m+ Patents ✔ PhysChem properties ✔ Commercially available substances ▶️ Ask about a demo today:

A quick substance search on Reaxys provides much more than documents: ✔ 250m+ Substances ✔ 72m+ Reactions ✔ 35m+ Patents ✔ PhysChem properties ✔ Commercially available substances ▶️ Ask about a demo today:

Elsevier | Life Sciences

140,183 Aufrufe • vor 3 Jahren

ขั้นตอนสุดท้ายสำหรับ ผิวสว่างกระจ่างใส ฉบับกลัฟ gulfkanawut รับ Summer☀️ Step 3/3 : Ginseng Brightening Serum 🤍 เซรั่มลดเลือนริ้วรอย ช่วยให้ผิวสว่างกระจ่างใส แน่นกระชับ เรียบเนียน ช่วยปรับปรุงสัญญาณแห่งความกระจ่างใส 7 ประการ** ✔ โทนสีผิว ✔ ผิวสัมผัส ✔ ความเปล่งประกายของผิว ✔ ความกระจ่างใสของผิว ✔ ความสว่างของจุดด่างดำ ✔ จำนวนของจุดด่างดำ ✔ และขนาดของจุดด่างดำ #GulfKanawut #กลัฟคณาวุฒิ #ลูกบอลของคุณ #SulwhasooThailand

ขั้นตอนสุดท้ายสำหรับ ผิวสว่างกระจ่างใส ฉบับกลัฟ gulfkanawut รับ Summer☀️ Step 3/3 : Ginseng Brightening Serum 🤍 เซรั่มลดเลือนริ้วรอย ช่วยให้ผิวสว่างกระจ่างใส แน่นกระชับ เรียบเนียน ช่วยปรับปรุงสัญญาณแห่งความกระจ่างใส 7 ประการ** ✔ โทนสีผิว ✔ ผิวสัมผัส ✔ ความเปล่งประกายของผิว ✔ ความกระจ่างใสของผิว ✔ ความสว่างของจุดด่างดำ ✔ จำนวนของจุดด่างดำ ✔ และขนาดของจุดด่างดำ #GulfKanawut #กลัฟคณาวุฒิ #ลูกบอลของคุณ #SulwhasooThailand

Sulwhasoo Thailand Official

41,515 Aufrufe • vor 2 Jahren