Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

We built Talos - a full CNN inference engine running directly on silicon. Every multiply, buffer, and data path lives as real digital logic on the FPGA. This is what deep learning looks like when the model becomes hardware👇

92,200 görüntüleme • 3 ay önce •via X (Twitter)

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

The Machine That Learns The Law Behind The Data A very very interesting US Patent US10963540B2 - Physics Informed Learning Machine describes a learning system that does not begin with data alone. It begins with a physical model, usually written as a differential equation (or PDE) dx/dt = f(x,t) A normal Machine Learning model sees scattered data and tries to fit it. A physics-informed learning machine starts with a law. Then it treats the data as evidence that updates what the model believes about the physical system. For this application, I use the patent idea on NASA C-MAPSS Turbofan engine data. The machine watches multivariate telemetry from a degrading engine and infers a hidden health state that is not measured directly. From that posterior belief, it estimates the engine’s remaining useful life. In the main 3D scene, the engine lifetime is turned into a tunnel. The spiral ribbons are real sensor channels evolving over cycle-time. The glowing core is the inferred health state. The surrounding cloud is uncertainty. The orange wall ahead is the predicted failure horizon. So the big picture is: sensor evidence comes in, posterior belief tightens, and the machine moves from uncertainty toward a concrete failure prediction. The inset posteriors make that explicit. The health posterior shows where the model believes the hidden engine condition sits at the current moment, and how sharply it believes it. The RUL posterior shows the same idea for remaining life... early on it is broad, later it shifts left and narrows as the machine becomes more certain about how close failure is. This idea is not limited to engines. The same idea can apply to data centers, CPUs, GPUs, cooling systems, power grids, robotics, batteries, and any machine that produces telemetry while obeying physical constraints. In an age where machine learning runs on massive hardware infrastructure, this kind of model matters: it can turn noisy sensor streams into early warnings before expensive systems fail.

Mathelirium

17,696 görüntüleme • 1 ay önce

After 8+ years on the Tesla Autopilot team and 3 years at Intel, I started Apex Compute to design a new architecture for efficient AI inference. For the past 9 months, we’ve been building our custom inference accelerator. Today we’re releasing Unified Engine v1. Last June we raised our seed round with Maxitech , DeepFin Research, Soma Capital and an incredible group of angel investors. In less than 9 months, we completed our RTL architecture and brought our first pre-silicon prototype to life on FPGA. Our architecture combines systolic array and vector processing in a single compute engine with multiple architectural optimizations, achieving very high FLOPs utilization. A single engine is super lean and it uses less than 90K LUTs and 1 MB Block RAM. It may also be one of the smallest logic-footprint compute engines developed so far. Our Unified Engine v1 supports: -matrix-matrix multiplication (~95% FLOPs utilization) -softmax (~90% FLOPs utilization) -broadcast and element-wise operations -RMSNorm / LayerNorm -block quantization/dequantization (fp4, int4) -multi-engine synchronization and many other operations. We even implemented memory-efficient attention similar to FlashAttention, reaching ~90% FLOP utilization. Full benchmarks and the software stack are available on our GitHub: We have basic compiler written in Python and it supports PyTorch tensors directly to easily test and transfer tensors between the accelerator and host using bf16, fp4 and int4 formats. Our FPGA prototype can already run LLM inference and outperform NVIDIA Jetson Orin Nano, even on a mid-tier FPGA setup (6.4x lower memory bandwidth, 18% slower clock speed at 4.5 Watts). Check the side-by-side comparison video below. Our GitHub includes low-level operator implementations, examples for tiled matrix multiplication, operation chaining, tensor parallelism, attention kernel and a full Gemma 3 1B model implementation. Many more models(Vision Transformers and VLA) are coming soon. Our accelerator IP is AXI-ready for deployment on any AMD(Xilinx) FPGA platform today. Even better, our two-engine prototype runs on an entry-level AMD(Xilinx) FPGA as a PCIe accelerator card. You can purchase it here for $50 to experiment our pre-silicon prototype on your desktop PC or Raspberry Pi 5. We will be releasing hardware bitstream updates as the architecture gets new features. More to come soon! We are expanding our team and looking for compiler engineers and floating-point hardware design engineers. If you're interested, please send me a DM.

Hasan

37,366 görüntüleme • 3 ay önce