%PDF-1.4
%
1 0 obj
<<
/Type /Metadata /Subtype /XML
/Length 8669
>>
stream
http://ns.adobe.com/pdfx/1.3/
pdfx
PDF/X Schema
external
URL to an online version or preprint
AuthoritativeDomain
Text
http://www.aiim.org/pdfua/ns/id/
pdfuaid
PDF/UA ID Schema
internal
Part of PDF/UA standard
part
Integer
PRISM metadata
http://prismstandard.org/namespaces/basic/2.2/
prism
aggregationType
Text
external
The type of publication. If defined, must be one of book, catalog, feed, journal, magazine, manual, newsletter, pamphlet.
url
URL
external
URL for the article or unit of content
pdfTeX
application/pdf
Learning What to Remember: Strategies for Selective External Memory in Online Reinforcement Learning Agents
Kenneth Young
Copyright © 2019 "Kenneth Young"
In realistic environments, intelligent agents must learn to integrate information from their past to inform present decisions. An agent's immediate observations are often limited, and some degree of memory is necessary to complete many everyday tasks. However, an agent cannot remember everything it observes. The history of observations may be arbitrarily long, making it impractical to store and process. In this thesis, we will develop a novel method, called online policy gradient over a reservoir (OPGOR), for selecting what to remember from the stream of observation. We will also explore a number of alternative methods for handling this selective memory problem.
OPGOR operates within the framework of external memory mechanisms for selective memory, which provide an agent with read/write access to a memory consisting of a fixed number of slots. Such mechanisms give rise to three key questions: what to read from memory, what to write to memory, and what to drop from memory when something is written.
We will focus on the question of how to learn to prioritize which information is written to and retained in an external memory. We focus on the online case, where a single agent acts and learns concurrently, with a limited amount of memory and compute time. In doing so, we hope to produce agents that can learn to perform well, while storing much less information. Our primary approach, OPGOR, will apply policy gradient to the process of selecting which state variables to store in memory from the entire trajectory.
Naively applying policy gradient to draw a subset of the full history of state variables would require us to store the full history of state variables and then draw a sample. This is not feasible for an online method. However, a variety of algorithms exist which maintain a fixed sized sample with particular statistical properties from a stream observed one item at a time. Such algorithms are called reservoir sampling algorithms, named for the fact that they maintain a fixed size sample, or reservoir, of items drawn from a stream.
OPGOR will use a reservoir sampling algorithm to maintain an external memory where the inclusion probability for each state variable in the history is given by a differentiable, closed form expression. This allows us to efficiently train our memory to maintain useful state variables.
We test OPGOR, along with a number of alternative selective memory strategies, on a set of psychology inspired problems, simplified to focus on the specific aspects of the problem we aim to investigate. In doing so, we explore the challenges of deciding what to retain in memory and to what degree various methods handle them.
some reinforcement learning
external memory
online learning
1
B
LaTeX with hyperref package
2019-01-29T11:59:48-07:00
2019-01-29T11:59:48-07:00
2019-01-29T11:59:48-07:00
True
Copyright © 2019 "Kenneth Young"
uuid:C8CFC28F-88E1-7995-E9AD-F6D12EAD346B
uuid:316A2BEF-15CA-0757-0F61-78A4060F3E70
endstream
endobj
2 0 obj
<< /S /GoTo /D (chapter.1) >>
endobj
5 0 obj
(Learning\040What\040to\040Remember)
endobj
6 0 obj
<< /S /GoTo /D (chapter.2) >>
endobj
9 0 obj
(Background)
endobj
10 0 obj
<< /S /GoTo /D (section.2.1) >>
endobj
13 0 obj
(Reinforcement\040Learning)
endobj
14 0 obj
<< /S /GoTo /D (section.2.2) >>
endobj
17 0 obj
(Partial\040Observability)
endobj
18 0 obj
<< /S /GoTo /D (section.2.3) >>
endobj
21 0 obj
(External\040Memory\040Systems\040and\040Writing\040Mechanisms)
endobj
22 0 obj
<< /S /GoTo /D (chapter.3) >>
endobj
25 0 obj
(Advantage\040Actor-Critic\040with\040External\040Memory)
endobj
26 0 obj
<< /S /GoTo /D (chapter.4) >>
endobj
29 0 obj
(Reservoir\040Sampling)
endobj
30 0 obj
<< /S /GoTo /D (chapter.5) >>
endobj
33 0 obj
(Online\040Policy\040Gradient\040Over\040a\040Reservoir\040\(OPGOR\))
endobj
34 0 obj
<< /S /GoTo /D (section.5.1) >>
endobj
37 0 obj
(OPGOR\040for\040a\040Single-State\040Memory)
endobj
38 0 obj
<< /S /GoTo /D (section.5.2) >>
endobj
41 0 obj
(OPGOR\040for\040a\040Multiple-State\040Memory)
endobj
42 0 obj
<< /S /GoTo /D (section.5.3) >>
endobj
45 0 obj
(OPGOR\040with\040Multiple\040Read\040Heads)
endobj
46 0 obj
<< /S /GoTo /D (section.5.4) >>
endobj
49 0 obj
(\376\377\000O\000P\000G\000O\000R\000\050\003\273\000\051\000,\000\040\000O\000P\000G\000O\000R\000\040\000w\000i\000t\000h\000\040\000G\000e\000n\000e\000r\000a\000l\000i\000z\000e\000d\000\040\000A\000d\000v\000a\000n\000t\000a\000g\000e\000\040\000E\000s\000t\000i\000m\000a\000t\000i\000o\000n)
endobj
50 0 obj
<< /S /GoTo /D (section.5.5) >>
endobj
53 0 obj
(\376\377\000S\000o\000f\000t\000-\000O\000P\000G\000O\000R\000\050\003\273\000\051\000,\000\040\000O\000P\000G\000O\000R\000\040\000w\000i\000t\000h\000\040\000S\000o\000f\000t\000\040\000Q\000u\000e\000r\000i\000e\000s)
endobj
54 0 obj
<< /S /GoTo /D (section.5.6) >>
endobj
57 0 obj
(Integrating\040OPGOR\040with\040Actor-Critic)
endobj
58 0 obj
<< /S /GoTo /D (section.5.7) >>
endobj
61 0 obj
(Online\040Policy\040Gradient\040Over\040a\040Reservoir\040with\040Denominator\040Sampling\040\(OPGOR-DS\))
endobj
62 0 obj
<< /S /GoTo /D (section.5.8) >>
endobj
65 0 obj
(Time\040Complexity)
endobj
66 0 obj
<< /S /GoTo /D (chapter.6) >>
endobj
69 0 obj
(Experiments)
endobj
70 0 obj
<< /S /GoTo /D (section.6.1) >>
endobj
73 0 obj
(Single\040Decision\040Keychain)
endobj
74 0 obj
<< /S /GoTo /D (section.6.2) >>
endobj
77 0 obj
(Two\040Decision\040Keychain)
endobj
78 0 obj
<< /S /GoTo /D (section.6.3) >>
endobj
81 0 obj
(Rapid\040Reward\040Valuation)
endobj
82 0 obj
<< /S /GoTo /D (section.6.4) >>
endobj
85 0 obj
(Randomized\040Maze)
endobj
86 0 obj
<< /S /GoTo /D (section.6.5) >>
endobj
89 0 obj
(Simple\040Counterexample)
endobj
90 0 obj
<< /S /GoTo /D (chapter.7) >>
endobj
93 0 obj
(Conclusion)
endobj
94 0 obj
<< /S /GoTo /D (section.7.1) >>
endobj
97 0 obj
(Contributions\040and\040Insights)
endobj
98 0 obj
<< /S /GoTo /D (section.7.2) >>
endobj
101 0 obj
(Future\040Work)
endobj
102 0 obj
<< /S /GoTo /D (subsection.7.2.1) >>
endobj
105 0 obj
(Learned\040State\040Representation)
endobj
106 0 obj
<< /S /GoTo /D (subsection.7.2.2) >>
endobj
109 0 obj
(Assumptions\040and\040Approximations)
endobj
110 0 obj
<< /S /GoTo /D (subsection.7.2.3) >>
endobj
113 0 obj
(Improved\040Understanding\040of\040Soft\040Query\040OPGOR)
endobj
114 0 obj
<< /S /GoTo /D (subsection.7.2.4) >>
endobj
117 0 obj
(Additive\040Networks\040with\040Stochastically\040Sampled\040Updates\040)
endobj
118 0 obj
<< /S /GoTo /D (subsection.7.2.5) >>
endobj
121 0 obj
(Value\040Based\040Selective\040Memory)
endobj
122 0 obj
<< /S /GoTo /D (section.7.3) >>
endobj
125 0 obj
(Summary)
endobj
126 0 obj
<< /S /GoTo /D (section.7.3) >>
endobj
128 0 obj
(References)
endobj
129 0 obj
<< /S /GoTo /D [130 0 R /Fit] >>
endobj
132 0 obj
<<
/Length 494
/Filter /FlateDecode
>>
stream
xڝRM0WHر4
,H
iB!MYq&mPrO7 |;AV'r5{FLjtՖɪʰR̹І6KjkL&y'{{اJ&keEmeHI+='eu@o'g 48@;naGB<7Zl}tup\:~.N{_ϢCF>WZrY,3JW߮:sUNSv0H.8{l4Rb}Rw]*HU ñk,)CDѤy!UiMLj|.RZ+xBO5xkqU8çPJs\]wxp~D[txc=WŲr>
endobj
133 0 obj
<<
/D [130 0 R /XYZ 107 757.862 null]
>>
endobj
134 0 obj
<<
/D [130 0 R /XYZ 108 720 null]
>>
endobj
131 0 obj
<<
/Font << /F17 135 0 R /F15 136 0 R /F19 137 0 R >>
/ProcSet [ /PDF /Text ]
>>
endobj
141 0 obj
<<
/Length 1589
/Filter /FlateDecode
>>
stream
xڍWK6WVȺ:&@$@KE,%$wuTЁ~TIs'y;QobW76*kvyQl's;!vmYf(n|U96mNF;i5ɓvv_RY&}b{ij%]+XNN֍2h;m
GgG,$J;&ϒ(>%:^U=wۇ"7ok~5k{=mЯ(yX|p&͌uP=`rH5urjzdިF랙=&)'#٨%=yMA`(Ky2 ?""e#_-J\puF0WSH:9M6S5 3a lwEdQXz?d /Gwx`DqUc&]z:1q