'mechanistic interpretability' 카테고리의 글 목록

mechanistic interpretability 7

[Paper review] Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation

https://arxiv.org/pdf/2406.13663RAG document에서 찾은 문서에서 정답 생성시 attribution (citation)을 생성Related worksanswer attributionRAG에서 retrieved document중 어느것이 생성된 answer를 support하는지 찾아낸는것https://aclanthology.org/2023.emnlp-main.398.pdf에서는 hard prompt + ICL(few shot example)로 LLM이 citation을 작성하게 하고response를 sampling할때 여러개(4개)를 뽑아서 위와같이 NLI를 사용해 citation을 평가, citation recall이 가장 좋은 response를 선택하여 사용 https:..

mechanistic interpretability 2024.09.09

[Paper review] LINEAR REPRESENTATIONS OF SENTIMENTIN LARGE LANGUAGE MODELS

https://arxiv.org/pdf/2310.15154LLM이 sentiment관련 task를 풀때 사용하는 direction에 대한 연구더보기https://github.com/curt-tigges/eliciting-latent-sentiment/tree/main GitHub - curt-tigges/eliciting-latent-sentimentContribute to curt-tigges/eliciting-latent-sentiment development by creating an account on GitHub.github.com Contribution1. sentiment의 linear representation을 synthetic data에서 찾음2. 위 direction으로 실제 dat..

mechanistic interpretability 2024.09.03

[Paper review] Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

https://arxiv.org/pdf/2406.20086LLM이 multitoken sequence (multitoken word 혹은 관용구)를 어떻게 처리하는가?Contribution1. multitoken word / named entity의 residual stream 마지막 토큰 포지션에서 앞에 나온 토큰에 대한 정보를 decoder only모델 앞쪽 layer에서 삭제함2. 이걸 통해 모델이 내부적으로 사용하는 실질적 vocabulary를 파악Method이전의 interpretability 연구에서 대부분 문장 혹은 프롬프트의 마지막 포지션을 그 전체의 semantic을 encode하고 있다고 가정하고 사용했으니 실제로 어떤 정보를 갖고 있는지 알아보겠다.예를 들어 "Star"," Wars"..

mechanistic interpretability 2024.08.21

[Paper review] Function Vectors in Large Language Models

https://arxiv.org/pdf/2310.15213모델이 어떻게 ICL(in-context learning)을 하는가?Related worksICL (In-context learning)언어모델이 inference time에 적은 수의 demonstration에서 어떤 task를 푸는것인지 '학습'하는 것1. transformers are few shot learner; GPT3 논문에서 처음으로 제시됨2. https://arxiv.org/abs/2211.15661 에서 ICL이 synthetic task (linear regression)에서 Stochastic Gradient Descent임을 간접적으로 보임3. https://arxiv.org/pdf/2212.10559 에서 ICL이 일반적인..

mechanistic interpretability 2024.08.12

[Paper review] Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries

https://arxiv.org/pdf/2406.12775LLM이 multi-hop QA (closed book, zero shot)을 어떻게 하는가? 틀리는 경우 왜 틀리는가?multi-hop query : 단계적 추론이 필요한 문제들; 예를 들어 "The spouse of the performer of Imagine is" 다음 토큰을 예측하려면 1. Performer of Imagine is : John Lennon2. Spouse of John Lennon is : Yoko Ono를 단계적으로 풀어야 한다. 물론 위 정보를 한번에 저장하고 읽을 수 있지만 그렇지 않음에 대해서도 다룬다.논문에서는 정확히는 2-hop query에 대해서만 다룬다.SetupDatasetKG 형식의 데이터셋 Wikid..

mechanistic interpretability 2024.08.07

[Paper review]Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

https://arxiv.org/pdf/2311.04897모델의 hidden state(residual stream)가 당장 다음 토큰 외 그 이후의 토큰도 예측하는가?Related workslogit lens, tuned lens 등 residual stream에서 각 레이어에서 next token prediction을 만들어가는 과정을 human interpretable하게 볼 수 있음이와 연관되게 모델의 prediction 도중의 layer에서 바로 early decoding 하는 연구도 있음여기서는 그에 대한 follow-up으로 당장 다음토큰뿐 아니라 그 뒤의 토큰까지 예측할 수 있음을 보임Methodspreliminaries논문에서는 GPT (decoder only transformer)를 다..

mechanistic interpretability 2024.07.31

[Paper review] A Mechanistic Understanding of Alignment Algorithms:A Case Study on DPO and Toxicity

https://arxiv.org/pdf/2401.01967IntroductionRelated works1. Transformer MLP unembeded : 이 논문 과 이 블로그포스트에서 트랜스포머 각 mlp output 뉴런의 weight을 unembed layer에 통과시켜 나온 logit으로 interprete함2. https://arxiv.org/pdf/2311.12786 등에서 finetuning의 영향을 mechanistic 하게 interprete함Contribution1. 위 mlp unembedding을 이용하여 gpt2-medium에서 toxic한 contribution을 하는 neuron을 찾은 뒤2. 이를통해 toxic generation을 suppress하고3. DPO를 통해 t..

mechanistic interpretability 2024.07.25

mech. interp blogpost

mechanistic interpretability. 딥러닝 모델을 리버스 엔지니어링하는 연구입니다. alien neuroscience :)

linear representation hypothesis, activation steering, 논문리뷰, supporting factor, controllable generation, reft, mechanistical interpretability, patch patching, answer attribution, multi-hop qa, toxicity, input attribution, representation engineering, future lens, XAI, activation patching, multitoken, tuned lens, mechanistic interpretability, linear representation,

Today :
Yesterday :

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

mechanistic interpretability 7

티스토리툴바