📑

AI Paper Research

AI 논문 조사 및 정리

Foundations
AI 안전성·정렬AI Safety & Alignment
Sleeper Agents: Training Deceptive LLMs ...
Towards Monosemanticity: Decomposing Lan...Representation Engineering: A Top-Down A...Weak-to-Strong Generalization: Eliciting...
Constitutional AI: Harmlessness from AI ...Training a Helpful and Harmless Assistan...Red Teaming Language Models to Reduce Ha...
TruthfulQA: Measuring How Models Mimic H...
Certified Adversarial Robustness via Ran...
Explaining and Harnessing Adversarial Ex...
홈/AI 안전성·정렬/2021

AI 안전성·정렬 — 2021

1편의 논문

ACL 20221,500+

TruthfulQA: Measuring How Models Mimic Human Falsehoods

TruthfulQA: 모델이 인간의 거짓을 모방하는 정도 측정

Stephanie Lin, Jacob Hilton, Owain Evans (2021)

← AI 안전성·정렬 전체