I am a professor of Computer Engineering at Koç University in Istanbul and the founding director of the KUIS AI Center. Previously I was at the MIT AI Lab for 12 years and later co-founded Inquira, Inc. My research is in natural language processing and machine learning.
For prospective students here are some research topics, papers, classes, blog posts and past students.
|
Koç Üniversitesi Bilgisayar Mühendisliği Bölümü'nde öğretim üyesiyim ve KUIS AI Merkezi'nin kurucu müdürüyüm. Bundan önce 12 yıl MIT Yapay Zeka Laboratuarı'nda çalıştım ve Inquira, Inc. şirketini kurdum. Araştırma konularım doğal dil işleme ve yapay öğrenmedir. İlgilenen öğrenciler için araştırma konuları, makaleler, verdiğim dersler, Türkçe yazılarım, ve mezunlarımız.
|
October 18, 2024
August 08, 2024
Emre Can Açıkgöz, M.S. 2024
Current position: PhD Student at University of Illinois Urbana-Champaign, Illinois (Homepage) MS Thesis: Grounding Language in Motor Space: Exploring Robot Action Learning and Control from Proprioception. August 2024. (PDF, Presentation) |
Language development, particularly in its early stages, is deeply correlated with sensory-motor experiences. For instance, babies develop progressively via unsupervised exploration and incremental learning, such as labeling the action of ”walking” by first discovering to move their legs via trial and error. Drawing inspiration from this developmental process, our study explores robot action learning by trying to map linguistic meaning onto non-linguistic experiences in autonomous agents, specif- ically for a 7-DoF robot arm. While current grounded language learning (GLL) in robotics emphasizes visual grounding, our focus is on grounding language in a robot’s internal motor space. We investigate this through two key aspects: Robot Action Classification and Language-Guided Robot Control, both within a ’Blind Robot’ scenario by relying solely on proprioceptive information without any visual input in pixel space. In Robot Action Classification, we enable robots to understand and categorize their actions using internal sensory data by leveraging Self-Supervised Learning (SSL) through pretraining an Action Decoder for better state representation. Our SSL-based approach significantly surpasses other baselines, particularly in scenarios with limited data. Conversely, Language-Guided Robot Control poses a greater challenge by requiring robots to follow natural language instructions, interpret linguistic commands, generate a sequence of actions, and continuously interact with the environment. To achieve that, we utilize another Action Decoder pretrained on sensory state data and then fine-tune it alongside a Large Language Model (LLM) for better linguistic reasoning abilities. This integration enables the robot arm to execute language-guided manipulation tasks in real time. We validated our approach using the popular CALVIN Benchmark, where our methodology based on SSL significantly outperformed traditional architectures, particularly in low-data scenarios on action classification. Moreover, in the instruction following tasks, our Action Decoder-based framework achieved on-par results with large Vision-Language Models (VLMs) in the CALVIN table-top environment. Our results underscore the importance of robust state representations and the potential of the robot’s internal motor space for learning embodied tasks.
Full post...
April 09, 2024
IPN röportajı: Yapay Zeka Devrimi: Gelecekte Nasıl Ayakta Kalırız
Full post...
October 26, 2023
Batuhan Özyurt, M.S. 2023
Current position: AI Research Engineer, Codeway Studios (LinkedIn) MS Thesis: Localizing Knowledge in Large Language Model Representations. October 2023. (PDF) |
Large language models (LLMs) are very proficient in NLP tasks. In the first part of this work, we evaluate the performance of LLMs on the task of finding the locations of characters inside a long narrative. The objective of the task is to generate the correct answer when the input is a piece of a narrative followed by a question asking the location of a character. For the evaluation of the task, we generate two new datasets by annotating the characters and their locations in the narratives: Andersen and Persuasion. We show that the LLM performance is not satisfactory on these datasets when compared to the simple baseline we designed that does not use machine learning. We also experiment with in-context learning to improve the performance and report results. Moreover, we address the problem that the LLMs are limited by the bounded context length. We hypothesize that if we localize the character-location relation information among the activations inside an LLM, we can store those activations and inject them into other models that are run with a different prompt so that the LLM can answer the questions about the information that was carried from another prompt, even though the character and location relation is not mentioned explicitly in the current prompt. We develop five different techniques to localize the character-location relation information occurring in the LLMs: Moving and adding LLM activations to other prompts, adding noise to LLM activations, checking cosine similarity between LLM activations, editing LLM activations, and visualizing attention scores during answer generation. We report the observations we made using these techniques.
Full post...
September 15, 2023
İlker Kesen, Ph.D. 2023
Current position: Postdoctoral Scientist at Department of Computer Science, University of Copenhagen - DIKU (LinkedIn, Website, Scholar, Github, Twitter) PhD Thesis: Advancing Toward Temporal and Commonsense Reasoning in Vision-Language Learning. September 2023. (PDF, Presentation) |
Humans learn to ground language to the world through experience, primarily visual observations. Devising natural language processing (NLP) approaches that can reason in a similar sense to humans is a long-standing objective of the artificial intelligence community. Recently, transformer models exhibited remarkable performance on numerous NLP tasks. This is followed by breakthroughs in vision-language (V&L) tasks, like image captioning and visual question answering, which require connecting language to the visual world. These successes of transformer models encouraged the V&L community to pursue more challenging directions, most notably temporal and commonsense reasoning. This thesis focuses on V&L problems that require either temporal reasoning, commonsense reasoning, or both simultaneously. Temporal reasoning is the ability to reason over time. In the context of V&L, this means going beyond static images, i.e., processing videos. Commonsense reasoning requires capturing the implicit general knowledge about the world surrounding us and making an accurate judgment using this knowledge within a particular context. This thesis comprises four distinct studies that connect language and vision by exploring various aspects of temporal and commonsense reasoning. Before advancing to these challenging directions, (i) we first focus on the localization stage: We experiment with a model that enables systematic evaluation of how language-conditioning should affect the bottom-up and the top-down visual processing branches. We show that conditioning the bottom-up branch on language is crucial to ground visual concepts like colors and object categories. (ii) Next, we investigate whether the existing video-language models thrive in answering questions about complex dynamic scenes. We choose the CRAFT benchmark as our test bed and show that the state-of-the-art video-language models fall behind human performance by a large margin, failing to process dynamic scenes proficiently. (iii) In the third study, we develop a zero-shot video-language evaluation benchmark to evaluate the language understanding abilities of pretrained video-language models. Our experiments reveal that the current video-language models are no better than the vision-language models, processing static images as input in processing daily dynamic actions. (iv) In the last study, we work on a figurative language understanding problem called euphemism detection. Euphemisms tone down expressions about sensitive or unpleasant issues. The ambiguous nature of euphemistic terms makes it challenging to detect their actual meaning within a context where commonsense knowledge and reasoning are necessities. We show that incorporating additional textual and visual knowledge in low-resource settings is beneficial to detect euphemistic terms. Nonetheless, our findings on these four studies still demonstrate a substantial gap between current V&L models' abilities and human cognition.
Full post...