📑

AI Paper Research

AI 논문 조사 및 정리

Foundations
AI 안전성·정렬AI Safety & Alignment
Sleeper Agents: Training Deceptive LLMs ...
Towards Monosemanticity: Decomposing Lan...Representation Engineering: A Top-Down A...Weak-to-Strong Generalization: Eliciting...
Constitutional AI: Harmlessness from AI ...Training a Helpful and Harmless Assistan...Red Teaming Language Models to Reduce Ha...
TruthfulQA: Measuring How Models Mimic H...
Certified Adversarial Robustness via Ran...
Explaining and Harnessing Adversarial Ex...
홈/AI 안전성·정렬/2023

AI 안전성·정렬 — 2023

3편의 논문

Anthropic Research500+

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

단의성을 향하여: 사전 학습으로 언어 모델 분해

Trenton Bricken, Adly Templeton, Joshua Batson et al. (2023)

arXiv400+

Representation Engineering: A Top-Down Approach to AI Transparency

표현 공학: AI 투명성에 대한 탑다운 접근법

Andy Zou, Long Phan, Sarah Chen et al. (2023)

arXiv (OpenAI)300+

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

약-강 일반화: 약한 감독으로 강한 능력 이끌어내기

Collin Burns, Haotian Ye, Dan Klein et al. (2023)

← AI 안전성·정렬 전체