📑

AI Paper Research

AI 논문 조사 및 정리

Foundations
AI 안전성·정렬AI Safety & Alignment
Sleeper Agents: Training Deceptive LLMs ...
Towards Monosemanticity: Decomposing Lan...Representation Engineering: A Top-Down A...Weak-to-Strong Generalization: Eliciting...
Constitutional AI: Harmlessness from AI ...Training a Helpful and Harmless Assistan...Red Teaming Language Models to Reduce Ha...
TruthfulQA: Measuring How Models Mimic H...
Certified Adversarial Robustness via Ran...
Explaining and Harnessing Adversarial Ex...
홈/AI 안전성·정렬/2024

AI 안전성·정렬 — 2024

1편의 논문

arXiv (Anthropic)300+

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

슬리퍼 에이전트: 안전 훈련에도 지속되는 기만적 LLM 훈련

Evan Hubinger, Carson Denison, Jesse Mu et al. (2024)

← AI 안전성·정렬 전체