'linear representation hypothesis' 태그의 글 목록

linear representation hypothesis 2

[Paper review]Refusal in Language ModelsIs Mediated by a Single Direction

https://arxiv.org/pdf/2406.11717IntroductionRelated works1. Features as direction : Model steering 등에서 contrastive한 pair를 통해 모델의 activation을 추출, feature를 찾아낼 수 있고 이 feature vectors를 residual stream에 더해 모델의 behaviour를 바꿀 수 있음.1.1. 또한 feature가 linear하게 표현된다는 가정하에 모델에서 concept removal을 시행하는 여러 work가 존재함2. Undoing safety tuning : harmful instruction과 completion의 데이터셋으로 모델의 학습된 거부 응답을 성능적인 loss없이 무시하..

activation steering 2024.07.10

[Paper review]Inference-Time Intervention:Eliciting Truthful Answers from a Language Model

https://arxiv.org/pdf/2306.03341Introductionprevious works1. large language model이 real-world correctness에 대한 latent, interpretable 한 structure가 있음을 확인함(https://arxiv.org/abs/2212.03827)2. large language model이 실제로 출력하는 것보다 '아는것'이 더 많음을 확인함(https://arxiv.org/abs/2010.11967) (논문에는 없지만 concurrent work https://arxiv.org/pdf/2304.13734)2-1. 실제로 이 논문에서 사용하는 TruthfulQA dataset의 경우 probe accuracy와 실제 g..

activation steering 2024.07.05

mech. interp blogpost

mechanistic interpretability. 딥러닝 모델을 리버스 엔지니어링하는 연구입니다. alien neuroscience :)

input attribution, activation steering, reft, 논문리뷰, mechanistic interpretability, patch patching, toxicity, linear representation, tuned lens, representation engineering, controllable generation, multi-hop qa, supporting factor, linear representation hypothesis, multitoken, XAI, future lens, answer attribution, activation patching, mechanistical interpretability,

Today :
Yesterday :

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

linear representation hypothesis 2

티스토리툴바