As the war over artificial intelligence (AI) chatbots has been heating up in the past few months, Microsoft has unveiled Kosmos-1, a new AI model. The new model is capable to respond to visual cues or images, apart from text prompts or messages.
The multimodal large language model (MLLM) can help the user with an array of new tasks, including visual question answering, image captioning and more.
Kosmos-1 could pave the way for the next stage beyond ChatGPT’s text prompts.
Microsoft’s AI researchers in a paper wrote: “A big convergence of language, multimodal perception, action, and world modelling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context and follow instructions.”
The paper further suggested that multimodal perception, or knowledge acquisition and ‘grounding’ in the real world, is needed to move beyond ChatGPT-like capabilities to artificial general intelligence (AGI), reports ZDNet.