Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

Imagine if language models could tap into the app ecosystem of your iPhone. Would the need for plugins and assistants become obsolete if we simply allowed a model to orchestrate our existing (and many years robust) user interfaces? This demonstrates the extent to which GPT-4V excels as a Generalist...

30,819 Aufrufe • vor 2 Jahren •via X (Twitter)

10 Kommentare

Profilbild von Francesco
Francescovor 2 Jahren

Over the last few months, I've been dabbling with using vision models not just in one area, but across web, desktop, and mobile platforms. It's become clear to me that there's a lot of untapped potential in these technologies. The closer we get them to our everyday gadgets, the better we can make use of what they have to offer. This shift could make our connection with AI feel more intuitive and seamless, moving away from a chatgpt-esque interaction with AI assistants.

Profilbild von Francesco
Francescovor 2 Jahren

Fibally got around to writing up my thoughts on UI-focused AI agents – it's not super deep, but it's filled with my takes and a bit of nerdy exploration. Slapped on my Medium hat for this one and dove right in.

Profilbild von Aditya P. Advani
Aditya P. Advanivor 2 Jahren

Consider joining, will be looking into remote control next

Profilbild von Rahul Janagouda
Rahul Janagoudavor 2 Jahren

I’ve been pondering on a similar idea. Being an android engineer I am working on using multi modal models to automate app. A world where we interact with voice (through glasses, pins, some kinda wearables) and use the phone only when we need to do some complex/UI task is not far.

Profilbild von Francesco
Francescovor 2 Jahren

I’ll be working in the next couple of days on a series of posts on the glue that made all of this possible and be publishing the latest on my GH – if you really are curious some of the latest are in the appium branch already!

Profilbild von Francesco
Francescovor 2 Jahren

Kudos to @Daniel1Paulus for the extensive iOS 17 work with go-ios

Profilbild von Francesco
Francescovor 2 Jahren

/cc @mreflow

Profilbild von Francesco
Francescovor 2 Jahren

/cc @karpathy

Profilbild von Francesco
Francescovor 2 Jahren

/cc @praeclarum

Profilbild von 小韭菜👁️💎,🐦‍⬛🔑
小韭菜👁️💎,🐦‍⬛🔑vor 2 Jahren

@PublicAI_ #AI

Ähnliche Videos

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

AK

62,768 Aufrufe • vor 3 Jahren

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

kwindla

40,319 Aufrufe • vor 1 Monat