By continuing to browse the site, you are agreeing to our use of cookies. (More details)

Efficient Exploration in Structured Bandits and Reinforcement Learning

This offer is available in the following languages:
Français - Anglais

Ensure that your candidate profile is correct before applying. Your profile information will be added to the details for each application. In order to increase your visibility on our Careers Portal and allow employers to see your candidate profile, you can upload your CV to our CV library in one click!

Faites connaître cette offre !

General information

Reference : UMR9189-EMIKAU-001
Date of publication : Tuesday, July 30, 2019
Scientific Responsible name : Emilie Kaufmann
Type of Contract : PhD Student contract / Thesis offer
Contract Period : 36 months
Start date of the thesis : 1 November 2019
Proportion of work : Full time
Remuneration : 2 135,00 € gross monthly

Description of the thesis topic

The goal of this PhD is to propose new algorithms for complex sequential decision making problems, that are provably efficient and use the right amount of exploration.

Sequential decision making in a random environement is relevant for many applications, that range from recommender systems and adaptive clinical trials to the control of complex systems such as smart grids. Good decision policies require a trade-off between exploring the environment and exploiting actions that look more promising so far, in order to maximize some rewards(in a so-called reinforcement learning setting). In simple cases, where an agent repeatedly choose from different actions without any information on their structure (as in a stochastic multi-armed bandit problem), this trade-off is well understood and optimal algorithms exist. More recently, such algorithms have been proposed in so-called structured bandit models, where information on different actions may be acquired when trying a particular action (this is the case for example in the precense of a linear or low-rank structure) [1]. Those algorithms remain very inefficient in practice despite their asymptotic optimal guarantees. They are based on the knowledge of an "oracle" allocation between the different actions and require some forced-exploration that ensure that the proportions in which each action has been tried converges to the optimal allocation. This Tracking mechanism also emergent for other bandit problems, called active testing problems, in which there is no need to maximize rewards, but the goal is to learn something about the system as fast as possible (typically discover the action with maximal mean payoff [2] or the best current action in a game [3]).

In this PhD, we will investigate the use of alternative methods to replace the mechanism described above, which does not work well in practice. In particular, we will revisit the use of the optimism-in-face-of-uncertainty principle (algorithms based on confidence intervals taking the structure into account) or the use of Thompson Sampling [4] for structured bandits and for adaptive testing problems. Our focus will be on proposing a finite-time analysis of the new algorithms we will develop.

More broadly, we will try to understand what is the minimal exploration required beyond those variants of bandit problems with a finite number of arms. For example, in linear contextual bandits, it has been recently shown that no exploration is needed at all if the contexts are diverse enough [5]. We want to understand in which settings this property is preserved, in order to better understand exploration for reinforcement learning.

[1] Minimal Exploration in Structured Stochastic Bandits. Combes et al. NIPS 2017
[2] Optimal Best Arm Identification with Fixed Confidence. Garivier and Kaufmann. COLT 2016
[3] Monte-Carlo Tree Search by Best Arm Identification. Kaufmann and Koolen. NIPS 2017
[4] Analysis of Thompson Sampling for the Multi-armed Bandit Problem. Agrawal and Goyal. COLT 2012
[5] Mostly exploration-free algorithms for contextual bandits. Bastani et al. ArXiv:1704.09011, 2018.

Work Context

The candidate will work in the SequeL team of the CRIStAL laboratory (UMR 9189). This team in located at Inria Lille, 40 av. Halley in Villeneuve d'Asqu. The main advisor will be Emilie Kaufmann (CR CNRS, soon HDR) and Odalric-Ambrym Maillard (CR INRIA, HDR) will be the official advisor..

Constraints and risks

No constraint to report.

We talk about it on Twitter!