要提升微信看一看推荐混排的长期收益?试试深度强化学习
文章作者:rysanwang
内容来源:微信AI
导语
什么是强化学习
(1)基本概念
(2)与监督学习,非监督学习的区别
(3)Multi-armed bandit 多臂赌博机
(4)强化学习的算法和AlphaGo
(5)强化学习实践
import gym
import random
import numpy
N_BINS = [5, 5, 5, 5]
LEARNING_RATE=0.05
DISCOUNT_FACTOR=0.9
EPS = 0.3
MIN_VALUES = [-0.5,-2.0,-0.5,-3.0]
MAX_VALUES = [0.5,2.0,0.5,3.0]
BINS = [numpy.linspace(MIN_VALUES[i], MAX_VALUES[i], N_BINS[i]) for i in xrange(4)]
def discretize(obs):
return tuple([int(numpy.digitize(obs[i], BINS[i])) for i in xrange(4)])
qv = {}
env = gym.make('CartPole-v0')
print(env.action_space)
print(env.observation_space)
an = env.action_space.n
def get(s, a):
global qv
if (s, a) not in qv:
return 0
return qv[(s, a)]
def update(s, a, s1, r):
global qv
nows = get(s, a)
m0 = get(s1, 0)
m1 = get(s1, 1)
if m0 < m1:
m0 = m1
qv[(s, a)] = nows + LEARNING_RATE * (r + DISCOUNT_FACTOR * m0 - nows)
for i in range(500000):
obs = env.reset()
if i % 1000 == 0:
print i
for _ in range(5000):
s = discretize(obs)
s_0 = get(s, 0)
nowa = 0
s_1 = get(s, 1)
if s_1 > s_0:
nowa = 1
if random.random() <= EPS:
nowa = 1 - nowa
obs, reward, done, info = env.step(nowa)
s1 = discretize(obs)
if done:
reward = -10
update(s, nowa, s1, reward)
if done:
break
for i_episode in range(1):
obs = env.reset()
for t in range(5000):
env.render()
s = discretize(obs)
maxs = get(s, 0)
maxa = 0
nows = get(s, 1)
if nows > maxs:
maxa = 1
obs, reward, done, info = env.step(maxa)
if done:
print('Episode finished after {} timesteps'.format(t+1))
break
为什么用强化学习
(1)看一看混排
(2)统一的点击率预估排序
(3)强化学习的引入 - 优化长期收益
(4)强化学习的优势
混排三路召回,mp,video,news合并
Case
mp,video,video(0,1,1)
video,mp,mp(1,0,0)
video,video,video(1,0,0)
监督学习预测最优解是第三种,
选择点击率最大的。
强化学习预测最优解是第一种,
选择总收益最大的。
强化学习在看一看混排中的应用
(1)Session wise recommendation
(2)Personal DQN
(3)离线评估 AUC?
(4)线上效果
(5)模型优化
Session based recommendation
(6)模型优化
Bloom embedding & Dueling DQN
(7)模型优化Double DQN &
Dueling Double DQN (aka DDDQN)
(8)负反馈 Reward & Focal loss
一些思考
AC 和 GAN
我也不是RL的专家,但我认为GAN是使用RL来解决生成建模问题的一种方式。GAN的不同之处在于,奖励函数对行为是完全已知和可微分的,奖励是非固定的,以及奖励是agent的策略的一个函数。但我认为GAN基本上可以说就是RL。
Ian Goodfellow(生成对抗网络之父)