📑

AI Paper Research

AI 논문 조사 및 정리

Foundations
AI 안전성·정렬AI Safety & Alignment
Sleeper Agents: Training Deceptive LLMs ...
Towards Monosemanticity: Decomposing Lan...Representation Engineering: A Top-Down A...Weak-to-Strong Generalization: Eliciting...
Constitutional AI: Harmlessness from AI ...Training a Helpful and Harmless Assistan...Red Teaming Language Models to Reduce Ha...
TruthfulQA: Measuring How Models Mimic H...
Certified Adversarial Robustness via Ran...
Explaining and Harnessing Adversarial Ex...
홈/AI 안전성·정렬/2022

AI 안전성·정렬 — 2022

3편의 논문

arXiv1,500+

Constitutional AI: Harmlessness from AI Feedback

헌법적 AI: AI 피드백을 통한 무해성

Yuntao Bai, Saurav Kadavath, Sandipan Kundu et al. (2022)

arXiv2,000+

Training a Helpful and Harmless Assistant with RLHF

RLHF로 도움이 되고 무해한 어시스턴트 학습

Yuntao Bai, Andy Jones, Kamal Ndousse et al. (2022)

arXiv800+

Red Teaming Language Models to Reduce Harms

피해 감소를 위한 언어 모델 레드팀 테스팅

Deep Ganguli, Liane Lovitt, Jackson Kernion et al. (2022)

← AI 안전성·정렬 전체