논문 및 개념 정리

[2023] The Wisdom of Hindsight Makes Language Models Better Instruction Followers

2023. 12. 1. 09:21
목차
  1. 0. Abstract
  2. 1. Introduction
  3. 2. Hindsight Instruction Relabeling
  4. (1) Instruction Following as Goal-conditioned RL
  5. (2) Algorithm Overview
  6. (3) Instruction Relabeling
  7. 3. Experiments

0. Abstract

We consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner

 

 

1. Introduction

 Human alignment를 위해 두 가지 정도의 방향성이 있음

  • Proximal Policy Optimization (PPO): rather complex, sensitive to hyperparameters, and requires additional training in the reward model and value network
  • imitation learning: less data-effective as it only makes use of the success instruction-output pairs, completely abandoning the ones that do not align

 

Hindsight Instruction Relabeling(HIR) adopts the central idea of relabeling the instructions in a hindsight fashion. HIR alternates between two phases:

  • an online sampling phase to generate a dataset of instruction-output pairs
  • offline learning phase that relabels the instructions of each pair and performs standard supervised learning

Conceptual Comparison between HIR and baseline methods

 

 

2. Hindsight Instruction Relabeling

We can formulate the language model alignment as a goal-conditioned RL problem.

 

(1) Instruction Following as Goal-conditioned RL

A language model M can take instructional prompt p and initial query token sequence q={q0,...,qi} as input, and autoregressively predict next token ei+1=M(p,q,{e0,...,ei}).

We can view standard prompt-conditioned language tasks (e.g. multi-step reasoning) as a goal-reaching problem.

  • Goal Space G: space of instructional prompt p
  • State space S: space of input token sequence q∪{ei}
  • Action space A: space of output token ei+1
  • Transition probability P: M(ei+1|p,q,{e0,...,ei})
  • Reward R: alignment score of {e0,...,ei+1} with instruction p and query \textbf{q}, from human or scripted feedback, which is not used in HIR.

Here all G,S,A are space of token embeddings, but G corresponds to instructional prompts, while S,A corresponds to model inputs and outputs. In this way, we can also view the language model as a goal-conditioned policy:

π⊨M(ei+1|p,q,{e0,...,ei})

 

(2) Algorithm Overview

i. Online Sampling

Given instruction p and query \textbf{q}, we use τ=1 to get the output sequence o={e0,e1,...,eL}, which gives us the online replay dataset Donline.

Donline=⋃i=1N{pi,qi,oi}

 

ii. Offline Relabeling

For every instruction-output pair (p,q,o) that are not necessarily aligned, we relabel this pair with a new instruction that can align with the outcome of the model (p∗,q,o).

The new instruction p∗ is generated based on the feedback function R(p,q,o) and the instruction generation function ϕ(p,q,o,r), which can either be learned or scripted. For simplicity, ϕ is also scripted based on the correctness of the reasoning outcome

 

(3) Instruction Relabeling

Conduct instruction relabeling at intermediate time steps on the generated sub-output.

 

i. Sub-output Relabeling

It is important to sample partial outputs and relabel the instruction. In this way, we could give more dense feedback through instruction relabeling.

Consider we relabel the i−th time step. The input to the model is q∪{e0,...,ei−1}. We can edit the instruction as a future goal based on the future alignment score:

p∗=ϕ(p,q,{e0,...,eL},R(p,q,{e0,...,eL}))

where ϕ and R are the instruction generation function and feedback function.

 

ii. Contrastive Instruction Follwoing

Suppose oi=M(qi,pi). Given the log probability of oi conditioned on qk,pk as:

Pik=logPM(oi|qk,pk)

We define the following contrastive loss:

Lcontrastive=−∑i=1nlogexp(Pii)∑k−1nexp(Pik)

This helps to avoid the model learning the behavior that maps the same output for different instructions.

 

iii. Entropy Regularization

As a common practice in RL, we apply entropy regularization to the output given a particular instruction. This negative entropy term ensures the sampling phase won’t converge too early for better exploration.

Lentropy=∑i=1nPklogPk

 

3. Experiments

 

 

 

 

728x90
저작자표시 비영리 변경금지 (새창열림)

'논문 및 개념 정리' 카테고리의 다른 글

[transformers] Scaled Dot Product Attention  (0) 2024.06.26
Vector Outer Product  (0) 2024.06.26
Hold-out vs Cross-validation 차이  (0) 2023.07.16
Propensity Score  (0) 2023.06.28
[2021] (FLAN)FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS (Instruction-Tuning 논문)  (2) 2023.03.13
  1. 0. Abstract
  2. 1. Introduction
  3. 2. Hindsight Instruction Relabeling
  4. (1) Instruction Following as Goal-conditioned RL
  5. (2) Algorithm Overview
  6. (3) Instruction Relabeling
  7. 3. Experiments
'논문 및 개념 정리' 카테고리의 다른 글
  • [transformers] Scaled Dot Product Attention
  • Vector Outer Product
  • Hold-out vs Cross-validation 차이
  • Propensity Score
Fine애플
Fine애플
이것저것
끄적끄적이것저것
Fine애플
끄적끄적
Fine애플
전체
오늘
어제
  • 분류 전체보기 (167)
    • 논문 및 개념 정리 (27)
    • Pattern Recognition (8)
    • 개발 (57)
    • python 메모 (45)
    • pytorch, tensorflow (5)
    • 알고리즘 (9)
    • Toy Projects (4)
    • 통계이론 (2)
    • Reinforcement Learning (10)

블로그 메뉴

  • 홈

공지사항

인기 글

태그

  • 알고리즘
  • pandas
  • nlp
  • 자연어
  • PyTorch
  • 언어모델
  • GPU
  • 딥러닝
  • python
  • ubuntu
  • BigBird
  • Docker
  • container
  • reinforcement learning
  • Probability
  • Bert
  • transformer
  • miniconda
  • tensorflow
  • 개발환경

최근 댓글

최근 글

hELLO · Designed By 정상우.
Fine애플
[2023] The Wisdom of Hindsight Makes Language Models Better Instruction Followers
상단으로

티스토리툴바

개인정보

  • 티스토리 홈
  • 포럼
  • 로그인

단축키

내 블로그

내 블로그 - 관리자 홈 전환
Q
Q
새 글 쓰기
W
W

블로그 게시글

글 수정 (권한 있는 경우)
E
E
댓글 영역으로 이동
C
C

모든 영역

이 페이지의 URL 복사
S
S
맨 위로 이동
T
T
티스토리 홈 이동
H
H
단축키 안내
Shift + /
⇧ + /

* 단축키는 한글/영문 대소문자로 이용 가능하며, 티스토리 기본 도메인에서만 동작합니다.