24. 20170704 24
付録1:Pseudo-count関連疑似コード
Data structure (with initial value)
Case when having pseudo-count in each room, each thread has following data
psc_vcount = np.zeros((24, maxval + 1, frsize * frsize), dtype=np.float64)
24 is the number of rooms in Montezuma’s Revenge
Currently it is constant.
In the future, currently playing room and connection structure of rooms
should be detected automatically.
This will be useful to evaluate the value of exploration.
The value of exploration can be used as additional reward.
maxval is the max value of pixel in pseudo-count
Can be changed in option. Default:128
Real pixel value is scaled to fit this maxval
frsize is size of image in pseudo-count
Can be changed in option. Default:42
Screen of game is scaled to fit image size (frsize * frsize)
Case when having one pseudo-count, each thread has following data
psc_vcount = np.zeros((maxval + 1, frsize * frsize), dtype=np.float64)
Two cases in above can be selected by option
The order of dimension is important to have good memory locality
If dimension for pixel value comes last, the performance of training decreases
roughly 20%. Because the value of pixel is sparse and cause many cache miss.
25. 20170704 25
付録1:Pseudo-count関連疑似コード
Algorithm (algorithm to calcalate pseudo-reward)
vcount = psc_vcount[room_no, psc_image, range_k]
This is not a scalar, not a fancy index, but is a temporary array
room_no is index of the room currently playing
psc_image is screen image scaled to fit size:(frsize * frsize), pixel-value:maxval
range_k = np.array([i for i in range(frsize * frsize)]) (calculated in initialization)
psc_vcount[room_no, psc_image, range_k] += 1.0
The count of occurred pixel value is incremented
r_over_rp = np.prod(nr * vcount / (1.0 + vcount))
ρ / ρ‘ for each pixel is calculated, and ρ / ρ‘ for screen image is calculated
ρ / ρ‘ = {N/n} / {(N+1)/(n+1)} = nr * N / (1.0 + N) = nr * vcount /(1.0 + count)
nr = (n + 1.0) / n where n is the number of observation, count starts in initialization
psc_count = r_over_rp / (1.0 – r_over_rp)
This is a pseudo-count. As easily confirmed, r_over_rp / (1.0 – r_over_rp) = ρ/(ρ' – ρ)
Not directly calculate ρ/(ρ' – ρ).
Because both ρ' and ρ are very small, caluculation error in ρ' – ρ become big.
psc_reward = psc_beta / math.pow(psc_count + psc_alpha, psc_rev_pow)
This is a pseudo-reward calculated from pseudo-count
psc_beta = β and can be changed by option in each thread
psc_rev_pow = 1/P, P is float value and can be changed by option in each thread
Psc_alpha = math.pow(0.1, P) ; So,
math.pow(psc_count + psc_alpha, psc_rev_pow) = 0.1 for any P when psc_count is almost 0
28. 20170704 28
付録4:thread多様性の効果
Same parameters in every thread
Different parameters in each thread (diversity of parameters in threads)
Score went down to 0,
and not recovered from it
Score went down to 0,
but recovered from it
See: http://52.199.15.161/OpenAIGym/montezuma-x1/00index.html