RLAI open page

	Reinforcement Learning and Artificial Intelligence (RLAI)
	Project Proposals

The ambition of this page is to present project proposals for RLIP and encourage discussion and participation between class members

Reinforcement Learning and Music Composition

Music composition is a problem that has not been explored in the context of reinforcement learning. While the exact formalism of music composition as a reinforcement learning problem may be complicated, the general idea is quite intuitive. A reinforcement learning agent can select notes to play, while an observer gives rewards to the agent based on whether the observer enjoys the resulting melody.

There are some interesting issues that can be explored in this project:

Learning Quickly: A human observer will quickly lose interest if the agent continually produces dissonant passages, thus the agent must be able to produce acceptable passages in a small number of episodes.
Temporal Abstraction: This issue is brought about in two different ways. First, notes are held for varying numbers of beats. This means that certain notes are held through several timesteps. In a real-time setting this necessitates being able to handle actions which span multiple timesteps. The second issue is that there are certain sequences of notes that may be desirable to repeat. An agent that can learn to group such a sequence as a single unit may be better equipped to produce enjoyable music.
Complex Reward Structure: A reinforcement learning agent attempts to maximize its reward. However, in the case of music composition, this attempt to maximize reward will not necessarily result in a coherent melody line.

The problem of formulating composition as a reinforcement learning task is itself a difficult problem. We need to choose the granularity of the agent's action selection (does the agent select individual notes or entire passages). We also need to define the agent's state. We can consider the previous note played as the agent's 'state', but it may be more useful to remember the previous n notes selected. Another facet of the formulation is the fact that we are thinking in terms of a discretized set of notes. In reality, the space of tones is continuous and an agent could operate over this continuous state space. Another part of the formulation to think about is the use of domain knowledge. There are certain intervals which are commonly found in music such as thirds and fifths. It may be desirable to imbue the agent with such knowledge in order to assist the agent.

In order to make this a feasible project, we plan on using a reduced scale (perhaps a 5-note blues scale), and we are considering an agent that improvises on a given melody (rather than creating a new melody line from scratch). This will be mainly an empirical study and we will experiment with different learning agents with different parameters in order to see what gets the best results.

People involved in this project:
Eddie (erafols@cs)
Jonas (jonas@cs)

Clustering for Function Approximation

I have long been interested in applying clustering algorithms to continuous valued action spaces in the hope of creating an adaptive input representation for reinforcement learning.

The idea is fairly simple. Like with tile coding, we will look at various combinations of the input variables. If we have 3 continuous variables x,y,z, we would look at combinations like {x, y, z, xy, xz, xyz}. However, unlike tile coding, where you decide on the tilings (which generate input features) ahead of time, we will define a clustering algorithm that calculate these input features. With this approach, perhaps the agent could naturally adapt its interpretation of the input variables dynamically, allowing it to solve problems with different distributions of inputs. IE, hook the algorithm up to a black box that takes an action, spits out continuous variables and a reward, and the function approximation could be adapted online.

More complex approaches would try and somehow incorporate rewards or values into the cluster mechanism to groups inputs not only based on their values, but based on some more complex measure.

We could apply this to several 'standard' RL problems such as mountain car, the acrobat, etc.

If you have interest in this sort of project, please let me know.

Alternatively, if nobody wants to do this, we could make a pinball player. That would be fun too.
Brian (btanner@cs.ualberta.ca)

Computer Go

Computer Go is a challenging board game that has very simple rules and yet takes years to master. Traditional minimax search methods perform poorly due to the size of the search space (estimated at 10**360) and most importantly the difficulty in evaluating a position. Reinforcement Learning can be applied to learn the state-value function (e.g. Markus Enzenberger's NeuroGo at UoA) but there are many challenges that still need to be overcome.

This project will use a simplified version of Go, known as Atari-Go, on a 5x5 board. In this version, the game is won with the first capture, guaranteeing straightforward termination (whereas the full game is finished by agreement between the players). Many of the other subtleties of the full game (e.g. ko - infinite loops, difficulties in scoring, discrepancies between different rule-sets) can also be avoided. In addition, the state-value function for Atari-Go will hopefully learn a good prediction of the likelihood of capture, itself a useful feature for the full game.

The intention of the project is to learn to approximate the state-value function in terms of shape. In other words, a variety of function approximators will be explored that can learn patterns in high dimensional spaces. These could include (depending on how many people join the project): random representations (RRs), RRs with supervised learning, neural networks, tile coding.

Interested in this project? Put your name down here

Dave
Sverrir

I guess I should be interested in Go, since I didn't manage to come up with a project proposal yet. I hope there's not much knowledge of the game required, since I have absolutely none (so even a current computer program could probably beat me ;) )
I've been thinking a lot about doing something that comes from a real problem, and there would probably be numerous examples here, like controlling industrial processes, driving (or racing) a car, deciding how much supply to order if you're a store owner - or, most importantly deciding what to do for your course project. But, since I know close to nothing about all these, the "building the environment" part just seemed too complicated.
Anyway, I am considering all proposals from other people interested in working with me - I could even try to make up a CV if required :) . I hope tomorrow after the class I will be able to decide upon something, or that.

Cosmin Anonymous, Wed Feb 9 21:25:57 2005

Anonymous, Wed Feb 9 21:25:57 2005

I think the GO project is especially interesting since almost none of the most widely known two-player searches don’t really perform that well on it (due to the high branching factor). So a new solution approach for these kind of problems is really needed 

GO is a really simple game, you can find a good tutorial on it http://gobase.org/studying/rules/?id=0&ln=uk and can play on line via the Internet Go Server http://panda-igs.joyjoy.net/

David also sent me the following links to some papers that get you up to speed on the current work in the field.

[SDS94]
Nicol N. Schraudolph, Peter Dayan, and Terrence J. Sejnowski. Temporal
difference learning of position evaluation in the game of Go. In Advances
in Neural Information Processing 6. Morgan Kaufmann, 1994.
[ http://www.gatsby.ucl.ac.uk/~dayan/papers/sds94.pdf ]

[Enz96]
Markus Enzenberger. The integration of a priori knowledge into a Go
playing neural network, 1996. Available by Internet.
[ http://www.markus-enzenberger.de/neurogo.ps.gz ]

[Enz03]
Markus Enzenberger. Evaluation in Go by a neural network using soft
segmentation. In 10th Advances in Computer Games conference, pages 97-108,
2003.
[ http://www.markus-enzenberger.de/neurogo3.ps.gz ]

-Sverri

Sverrir, Wed Feb 9 22:05:24 2005

Extend this Page How to edit Style Subscribe Notify Suggest Help This open web page hosted at the University of Alberta. Terms of use 1560/0