[2023] The Wisdom of Hindsight Makes Language Models Better Instruction Followers

0. Abstract
1. Introduction
2. Hindsight Instruction Relabeling
(1) Instruction Following as Goal-conditioned RL
(2) Algorithm Overview
(3) Instruction Relabeling
3. Experiments

0. Abstract

We consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner

1. Introduction

Human alignment를 위해 두 가지 정도의 방향성이 있음

Proximal Policy Optimization (PPO): rather complex, sensitive to hyperparameters, and requires additional training in the reward model and value network
imitation learning: less data-effective as it only makes use of the success instruction-output pairs, completely abandoning the ones that do not align

Hindsight Instruction Relabeling(HIR) adopts the central idea of relabeling the instructions in a hindsight fashion. HIR alternates between two phases:

an online sampling phase to generate a dataset of instruction-output pairs
offline learning phase that relabels the instructions of each pair and performs standard supervised learning

Conceptual Comparison between HIR and baseline methods

2. Hindsight Instruction Relabeling

We can formulate the language model alignment as a goal-conditioned RL problem.

(1) Instruction Following as Goal-conditioned RL

A language model $M$ can take instructional prompt $p$ and initial query token sequence $q = {q_{0}, . . ., q_{i}}$ as input, and autoregressively predict next token $e_{i + 1} = M (p, q, {e_{0}, . . ., e_{i}})$ .

We can view standard prompt-conditioned language tasks (e.g. multi-step reasoning) as a goal-reaching problem.

Goal Space $G$ : space of instructional prompt $p$
State space $S$ : space of input token sequence $q \cup {e_{i}}$
Action space $A$ : space of output token $e_{i + 1}$
Transition probability $P$ : $M (e_{i + 1} | p, q, {e_{0}, . . ., e_{i}})$
Reward $R$ : alignment score of ${e_{0}, . . ., e_{i + 1}}$ with instruction $p$ and query \textbf{q}, from human or scripted feedback, which is not used in HIR.

Here all $G, S, A$ are space of token embeddings, but $G$ corresponds to instructional prompts, while $S, A$ corresponds to model inputs and outputs. In this way, we can also view the language model as a goal-conditioned policy:

$π ⊨ M (e_{i + 1} | p, q, {e_{0}, . . ., e_{i}})$

(2) Algorithm Overview

i. Online Sampling

Given instruction $p$ and query \textbf{q}, we use $τ = 1$ to get the output sequence $o = {e_{0}, e_{1}, . . ., e_{L}}$ , which gives us the online replay dataset $D_{online}$ .

$D_{online} = ⋃_{i = 1}^{N} {p_{i}, q_{i}, o_{i}}$

ii. Offline Relabeling

For every instruction-output pair $(p, q, o)$ that are not necessarily aligned, we relabel this pair with a new instruction that can align with the outcome of the model $(p^{*}, q, o)$ .

The new instruction $p^{*}$ is generated based on the feedback function $R (p, q, o)$ and the instruction generation function $ϕ (p, q, o, r)$ , which can either be learned or scripted. For simplicity, $ϕ$ is also scripted based on the correctness of the reasoning outcome

(3) Instruction Relabeling

Conduct instruction relabeling at intermediate time steps on the generated sub-output.

i. Sub-output Relabeling

It is important to sample partial outputs and relabel the instruction. In this way, we could give more dense feedback through instruction relabeling.

Consider we relabel the $i -$ th time step. The input to the model is $q \cup {e_{0}, . . ., e_{i - 1}}$ . We can edit the instruction as a future goal based on the future alignment score:

$p^{*} = ϕ (p, q, {e_{0}, . . ., e_{L}}, R (p, q, {e_{0}, . . ., e_{L}}))$

where $ϕ$ and $R$ are the instruction generation function and feedback function.

ii. Contrastive Instruction Follwoing

Suppose $o_{i} = M (q_{i}, p_{i})$ . Given the log probability of $o_{i}$ conditioned on $q_{k}, p_{k}$ as:

$P_{i k} = log P_{M} (o_{i} | q_{k}, p_{k})$

We define the following contrastive loss:

$L_{c o n t r a s t i v e} = - \sum_{i = 1}^{n} log \frac{exp (P_{i i})}{\sum_{k - 1}^{n} exp (P_{i k})}$

This helps to avoid the model learning the behavior that maps the same output for different instructions.

iii. Entropy Regularization

As a common practice in RL, we apply entropy regularization to the output given a particular instruction. This negative entropy term ensures the sampling phase won’t converge too early for better exploration.

$L_{e n t r o p y} = \sum_{i = 1}^{n} P_{k} log P_{k}$

3. Experiments

728x90

저작자표시 비영리 변경금지 (새창열림)

'논문 및 개념 정리' 카테고리의 다른 글

[transformers] Scaled Dot Product Attention (0)	2024.06.26
Vector Outer Product (0)	2024.06.26
Hold-out vs Cross-validation 차이 (0)	2023.07.16
Propensity Score (0)	2023.06.28
[2021] (FLAN)FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS (Instruction-Tuning 논문) (2)	2023.03.13

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[2023] The Wisdom of Hindsight Makes Language Models Better Instruction Followers

0. Abstract

1. Introduction

2. Hindsight Instruction Relabeling

(1) Instruction Following as Goal-conditioned RL

(2) Algorithm Overview

i. Online Sampling

ii. Offline Relabeling

(3) Instruction Relabeling

i. Sub-output Relabeling

ii. Contrastive Instruction Follwoing

iii. Entropy Regularization

3. Experiments

'논문 및 개념 정리' 카테고리의 다른 글

0. Abstract

1. Introduction

2. Hindsight Instruction Relabeling

(1) Instruction Following as Goal-conditioned RL

(2) Algorithm Overview

i. Online Sampling

ii. Offline Relabeling

(3) Instruction Relabeling

i. Sub-output Relabeling

ii. Contrastive Instruction Follwoing

iii. Entropy Regularization

3. Experiments

'논문 및 개념 정리' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역