'activation patching' 태그의 글 목록

activation patching 3

[Paper review] LINEAR REPRESENTATIONS OF SENTIMENTIN LARGE LANGUAGE MODELS

https://arxiv.org/pdf/2310.15154LLM이 sentiment관련 task를 풀때 사용하는 direction에 대한 연구더보기https://github.com/curt-tigges/eliciting-latent-sentiment/tree/main GitHub - curt-tigges/eliciting-latent-sentimentContribute to curt-tigges/eliciting-latent-sentiment development by creating an account on GitHub.github.com Contribution1. sentiment의 linear representation을 synthetic data에서 찾음2. 위 direction으로 실제 dat..

mechanistic interpretability 2024.09.03

[Paper review] Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries

https://arxiv.org/pdf/2406.12775LLM이 multi-hop QA (closed book, zero shot)을 어떻게 하는가? 틀리는 경우 왜 틀리는가?multi-hop query : 단계적 추론이 필요한 문제들; 예를 들어 "The spouse of the performer of Imagine is" 다음 토큰을 예측하려면 1. Performer of Imagine is : John Lennon2. Spouse of John Lennon is : Yoko Ono를 단계적으로 풀어야 한다. 물론 위 정보를 한번에 저장하고 읽을 수 있지만 그렇지 않음에 대해서도 다룬다.논문에서는 정확히는 2-hop query에 대해서만 다룬다.SetupDatasetKG 형식의 데이터셋 Wikid..

mechanistic interpretability 2024.08.07

[Paper review] ReFT: Representation Finetuning for Language Models

https://arxiv.org/abs/2404.03592parameter performance tradeoff를 고려하지 않더라도 instruction tuning, commonsense에서 sota를 달성이외 GLUE, arithmetic에서도 동일 수준 parameter에서 월등이 좋은 성능을 보임.Related worksadapterattention 혹은 mlp output에 mlp(adapter) 를 달아서 PEFTLoRA와는 다르게 weight을 다른 component에 fold할 수 없기때문에 inference에 추가적인 overhead가 발생 LoRAlow rank matrix a,b로 train과정에서 weight update를 approximate한다. weight을 fold할 수 있기..

activation steering 2024.08.02

mech. interp blogpost

mechanistic interpretability. 딥러닝 모델을 리버스 엔지니어링하는 연구입니다. alien neuroscience :)

multitoken, 논문리뷰, representation engineering, mechanistic interpretability, toxicity, future lens, linear representation hypothesis, answer attribution, mechanistical interpretability, XAI, activation steering, controllable generation, input attribution, reft, tuned lens, linear representation, patch patching, multi-hop qa, supporting factor, activation patching,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

activation patching 3

티스토리툴바