Загрузка видео...

Не удалось загрузить видео

На главную

Today we introduce T-Free, a new paradigm in language processing. Tokenization is one of the core building blocks of large language models (LLMs), transforming natural language into numeric representations for further processing. (1/3) 🔗 #writtenbyalephalpha

18,120 просмотров • 1 год назад •via X (Twitter)

Комментарии: 2

Фото профиля Aleph Alpha
Aleph Alpha1 год назад

Our innovation, T-Free, offers a novel approach to tokenization, boosting tokenizer fertility across various languages, and reducing the size of the embedding layer by up to 75% compared to traditional tokenizers. Early experiments with T-Free show promising results and could unlock new possibilities in LLMs, including: - Up to 50% reduction in training and inference costs - Improved semantic encoding of language - Enhanced performance in multilingual models (2/3)

Фото профиля Aleph Alpha
Aleph Alpha1 год назад

Read our full paper here: Dive into the source code of T-Free: Try out our interim research model checkpoints: (3/3)

Похожие видео

Today, we're joined by Julie Kallini ✨, PhD student at Stanford NLP Group to discuss her recent papers, “MrT5: Dynamic Token Merging for Efficient Byte-level Language Models” and “Mission: Impossible Language Models.” For the MrT5 paper, we explore the importance and failings of tokenization in large language models—including inefficient compression rates for under-resourced languages—and dig into byte-level modeling as an alternative. We discuss the architecture of MrT5, its ability to learn language-specific compression rates, its performance on multilingual benchmarks and character-level manipulation tasks, and its performance and efficiency. For the “Mission: Impossible Language Models” paper, we review the core idea behind the research, the definition and creation of impossible languages, the creation of impossible language training datasets, and explore the bias of language model architectures towards natural language. 🎧 / 🎥 Listen or watch the full episode on our page: 📖 CHAPTERS =============================== 00:00 - Introduction 4:28 - Issues of tokenization for LLMs 11:26 - Sub-word tokenization versus byte level tokenization 16:28 - Inefficiencies of byte T5 17:08 - Mr. T5 architecture 22:05 - Language-specific compression rate 24:10 - Benchmarks 27:15 - Inference efficiency 28:50 - Applying MrT5 to other decoder models 31:15 - Future directions of MrT5 33:51 - Mission: Impossible Language Models paper 39:59 - Languages tested 45:13 - Language architectures biased toward natural languages vs impossible languages 48:19 - Future directions for Mission Impossible

The TWIML AI Podcast

11,758 просмотров • 1 год назад