Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

My new app is out !! ✨The Common Crawl Pipeline Creator ✨ Create your pipeline easily: ✔Run Text Extraction✂️ ✔Define Language Filters🌐 ✔Customize text quality💯 ✔See Live Results👀 ✔Get Python code 🐍 Based on famous LLM research like Gopher, C4 or FineWeb

14,995 Aufrufe • vor 1 Jahr •via X (Twitter)

7 Kommentare

Profilbild von Quentin Lhoest 🤗
Quentin Lhoest 🤗vor 1 Jahr

Keep an eye on the data you are filtering out, you don't want to discard interesting quality & diverse data ! You can also check how much data is filtered out at each step

Profilbild von Quentin Lhoest 🤗
Quentin Lhoest 🤗vor 1 Jahr

The python code is based on the `datatrove` library that was used to build the FineWeb dataset, which has a text quality so good it makes training LLMs faster !

Profilbild von Quentin Lhoest 🤗
Quentin Lhoest 🤗vor 1 Jahr

Try the app by yourself !

Profilbild von Sinclair Wang
Sinclair Wangvor 1 Jahr

@qlhoest awesome app!

Profilbild von hannah
hannahvor 1 Jahr

@qlhoest this is so cool!

Profilbild von Giocobon
Giocobonvor 1 Jahr

@qlhoest Amazing job! What is the aim of the code line "partial(increment_num_warc_samples, num_warc_samples_per_doc=2000 / 1687)" in the app code? I mean, why 2000 / 1687 ?

Profilbild von Quentin Lhoest 🤗
Quentin Lhoest 🤗vor 1 Jahr

I load the data from a cache from an intermediate step that has 300+ examples filtered out

Ähnliche Videos