Vision-language model | Model Monster

A multimodal model that processes both visual inputs (images/video) and text, enabling tasks such as captioning, visual question answering, and document understanding.