A multimodal model that processes both visual inputs (images/video) and text, enabling tasks such as captioning, visual question answering, and document understanding.
A multimodal model that processes both visual inputs (images/video) and text, enabling tasks such as captioning, visual question answering, and document understanding.