By continuing to browse the site, you are agreeing to our use of cookies. (More details)

PhD Offer - H/F - Computer science - Multimodal representations for Multimedia Question Answering

This offer is available in the following languages:
Français - Anglais

Ensure that your candidate profile is correct before applying. Your profile information will be added to the details for each application. In order to increase your visibility on our Careers Portal and allow employers to see your candidate profile, you can upload your CV to our CV library in one click!

Faites connaître cette offre !

General information

Reference : UPR3251-CAMGUI-001
Workplace : ORSAY
Date of publication : Thursday, June 25, 2020
Scientific Responsible name : Camille Guinaudeau
Type of Contract : PhD Student contract / Thesis offer
Contract Period : 36 months
Start date of the thesis : 1 October 2020
Proportion of work : Full time
Remuneration : 2 135,00 € gross monthly

Description of the thesis topic

Context of the thesis

The PhD thesis will take place in the context of the MEERQAT project that aims to tackle the problem of analyzing ambiguous visual and textual content by learning and combining their representations and by taking into account the existing knowledge about entities. The objective of the project is not only to disambiguate one modality by using the other appropriately but also to jointly disambiguate both by representing them in a common space.

The full set of contributions proposed in the project will be used to solve a new task, namely Multimedia Question Answering (MQA). This task requires relying on three different sources of information to answer a textual question with regard to visual data as well as a knowledge base (KB) containing millions of unique entities and associated texts. In a simple form, the MQA task is actually commonly used in everyday life, through a decomposed process. For example, while watching a film or a TV-series, one can wonder “In which movie did I already see this actress?”. The answer usually requires to first determine the actress's name from the credits of the film, then to access a knowledge base such as IMDB or Wikipedia to obtain the list of the previous films the actress played. In a simpler form, such a scenario also responds to industrial needs. For example, in the context of maintenance or technical support, one may have to determine the reference of a particular product to access the available information required to perform a technical operation: the reference can be obtained using a visual query (taking a picture of the object, as other means are not always available); then the access to the relevant information can be posed as a QA problem.

MQA is related to the recent Visual Question Answering (VQA) problem, that consists in answering questions about the content of given images, but is different from it since we propose to consider questions whose meaning results from the combination of both text and image, e.g. the image can provide a context for understanding the text, the text can help to focus on some image region or both modalities give clues for searching an answer. Likewise, existing VQA systems consider only quite general categories [1], even when they use a KB [2], while we propose to deal with a large number of entities. Moreover, we propose to study how to model the collaboration between different modalities for answering questions, which is quite a new topic as, even in text and KB Question Answering (QA), most systems are dedicated to search for an answer either in text or in a knowledge base, but not both. Only some recent hybrid text and KB QA approaches developed a collaborative strategy. [3] defined query expansion and relaxation techniques to search for information in the text contexts associated with triples. [4], on the contrary, first searches for information in texts annotated with KB entities, and then use SPARQL queries if the text strategy is not successful. In [5], a hybrid search is actually performed by decomposing the questions into subparts that are searched in two kinds of resources and the resulting answers are aggregated for the final answer selection.

In the project, three types of modality are considered, namely (1) the visual modality extracted from pixels of the images (2) the textual modality extracted from the questions in natural language, the captions and other textual contents that are “near” an image and the textual documents used to populate a knowledge database with regard to entities and, (3) by a slight misnomer that will ease the understanding in the following, the structural modality that reflects the links that are identified between the entities and are recorded in the knowledge database. MEERQAT thus aims at answering a question composed of textual and visual modalities by relying on a knowledge database that contains information relative to the visual, the textual and the structural modalities.

Objectives of the thesis

The objective of the thesis is to exploit representations shared by several modalities, more specifically text, image, and knowledge, to propose, implement and evaluate models for the MQA task. This objective will be achieved following an incremental approach by first focusing on the selection of answers from texts or KB and then integrating, lately, then early, all the modalities.

To this end, the candidate will first work on the definition of a typology of questions relying on the usage of image, text and KB and study the role of each modality according to the type of questions, i.e. which modality is the most suitable for answering a given type of questions. This taxonomy will lead to the selection of existing corpora and to build automatically new ones.

Then, the candidate will focus on the late merging of modalities to propose a first MQA model. S/he will apply an extractive approach to select an answer where entities act as a pivot: the entities mentioned in the question are searched within the KB and information linked to them is extracted in order to match the question and to select answer candidates. These pieces of information are then aligned with subparts of the question. The answer is defined as the entity or the piece of text that best justifies the alignment, i.e. which links within the KB that best match the question. This approach allows great flexibility for the input: both textual and visual inputs can be used to find entities in the KB. Hence, the QA problem consists in decomposing the question into elementary sub-questions that can be aligned with known pieces of information. The decomposition can be based on linguistic knowledge as in [5] or learned along with the alignment process. The candidate will first study two modalities (text and KB), using existing word and entity representations. S/he will explore a late fusion of the modalities by decomposing questions into sub-questions and learning their alignment with the different sources (texts, entities or triples) using neural models with an attention mechanism aiming at selecting relevant information in the two representations that are compared [6]. S/he will then aggregate the results, for example by relying on a Integer Linear Programming approach [5], which makes it possible to model constraints for selecting the best candidate as an answer.

Finally, the candidate will study deep neural network architectures for modeling the complete task and will test several strategies to integrate different types of embeddings. More particularly, s/he will adapt different attention mechanisms to compare the multimodal question representation with the multimodal information extracted from the KB for guiding the answer selection process. S/he will also study the learning of the decomposition of questions along with the alignments and the answer prediction by joined or multi-task learning. We expect that such an integrated architecture for MQA will result in 1) enhancing the recognition of entities in the KB sources 2) improving the question representation by providing a richer context and learning a holistic semantic representation.


References
[1] Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 163:21 – 40, 2017. Language in Vision.
[2] Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel. Image captioning and visual question answering based on attributes and external knowledge. IEEE T. PAMI, 2017.
[3] M. Yahya, K. Berberich, S. Elbassuoni, and G. Weikum. Robust question answering over the web of linked data. In CIKM, 2013.
[4] S. Park, S. Kwon, B. Kim, and G. G. Lee. Isoft at qald-5: Hybrid question answering system over linked data and text data. In Working Notes of CLEF 2015, 2015.
[5] K. Xu, S. Reddy, Y. Feng, S. Huang, and D. Zhao. Question answering on freebase via relation extraction and textual evidence. In ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1, 2016
[6] H.-Y. Huang, C. Zhu, Y. Shen, and W. Chen. Fusionnet: Fusing via fully-aware attention with application to machine comprehension. In International Conference on Learning Representations, 2018.

Work Context

The PhD student will be welcomed into the Spoken Language Processing group at LIMSI located in Orsay (91400).

We talk about it on Twitter!