Joy Hsu's banner

Joy Hsu

@joycjhsu • 2,624 subscribers

CS PhD-ing @stanford & @knighthennessy. Studying visual reasoning, neuro-symbolic learning, and visual concepts @stanfordailab & @stanfordsvl.

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

What makes a maze look like a maze? Humans can reason about infinitely many instantiations of mazes—made of candy canes, sticks, icing, yarn, etc. But VLMs often struggle to make sense of such visual abstractions. We improve VLMs' ability to interpret these abstract concepts.

What makes a maze look like a maze? Humans can reason about infinitely many instantiations of mazes—made of candy canes, sticks, icing, yarn, etc. But VLMs often struggle to make sense of such visual abstractions. We improve VLMs' ability to interpret these abstract concepts.

43,025 görüntüleme • 1 yıl önce

What’s left w/ foundation models? We found that they still can't ground modular concepts across domains. We present Logic-Enhanced FMs:🤝FMs & neuro-symbolic concept learners. We learn abstractions of concepts like “left” across domains & do domain-independent reasoning w/ LLMs.

What’s left w/ foundation models? We found that they still can't ground modular concepts across domains. We present Logic-Enhanced FMs:🤝FMs & neuro-symbolic concept learners. We learn abstractions of concepts like “left” across domains & do domain-independent reasoning w/ LLMs.

48,289 görüntüleme • 2 yıl önce

How can we learn 3D visual grounding with natural supervision—only looking at QA pairs, without ground truth bounding boxes or object classification labels? We inject explicit language priors, e.g., the symmetric property that A near B ⇒ B near A, in structured vision models.

How can we learn 3D visual grounding with natural supervision—only looking at QA pairs, without ground truth bounding boxes or object classification labels? We inject explicit language priors, e.g., the symmetric property that A near B ⇒ B near A, in structured vision models.

14,843 görüntüleme • 2 yıl önce

Daha fazla içerik yok.