'분류 전체보기' 카테고리의 글 목록

분류 전체보기 13

2025 iclr 미리보기

iclr 2025 submission중 재밌어보이는거 모음 (reading list)넘기면서 읽은 내용 정리 https://openreview.net/pdf?id=SfNmgDqeEaLOOKING BEYOND THE TOP-1: TRANSFORMERS DETERMINE TOP TOKENS IN ORDERGPT계열에서 아래단 레이어에서 (logit lens상)어떤 토큰을 예측할지 결정했을때 그 이후 레이어에서 뭘 하는지에 대한 연구. 추가적으로 early exiting 성능향상 https://openreview.net/pdf?id=z1mLNhWFyYGRADIENT ROUTING: MASKING GRADIENTS TO LOCALIZE COMPUTATION IN NEURAL NETWORKS모델을 학습시킬때 특정..

개인용 2024.10.16

[Paper Review] Programming Refusal with Conditional Activation Steering

https://arxiv.org/pdf/2409.05907기존 activation steering에서는 모델이 원하는 대로 생성하게 만들 수 있으니 이제 원할'때' 원하는 대로 생성하게 하겠다.특히 모델이 받을수 있는 harmful한 query가 들어왔을때만 refuse하게 만들겠다https://ro1ex-ai.tistory.com/2 [Paper review]Refusal in Language ModelsIs Mediated by a Single Directionhttps://arxiv.org/pdf/2406.11717IntroductionRelated works1. Features as direction : Model steering 등에서 contrastive한 pair를 통해 모델의 activat..

activation steering 2024.09.26

ML을 위한 SVD 정리

0) SVDA=(U)(S)(V.T)의 형태로 어떤 matrix A든 분해할 수 있다.U,V는 orthonomal하고 S는 diagonal함shape은 다음과 같음 A를 데이터로 생각하면 V.T의 K개의 row로 나타나는 방향 성분이 각각 U*S 만큼 contribute해서 만들어진다고 생각할 수 있다.즉 위 우측 파란색, 보라색과 같은 matrix가 K개 더해져서(contribute) A가 만들어지는 형태인것. 이때 파란색 A는 D dimension의 방향벡터 V.T[0] 로 N개의 데이터(A)가 갖는 성분을 해당 방향으로 확장한 형태또한 orthogonal하기 때문에 겹치는 contribution이 없다고 할 수 있음 0.5)U, V는 rotation, S는 scaling이라고 생각할 수 있다. 즉 S..

개인용 2024.09.25

[Paper review] Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation

https://arxiv.org/pdf/2406.13663RAG document에서 찾은 문서에서 정답 생성시 attribution (citation)을 생성Related worksanswer attributionRAG에서 retrieved document중 어느것이 생성된 answer를 support하는지 찾아낸는것https://aclanthology.org/2023.emnlp-main.398.pdf에서는 hard prompt + ICL(few shot example)로 LLM이 citation을 작성하게 하고response를 sampling할때 여러개(4개)를 뽑아서 위와같이 NLI를 사용해 citation을 평가, citation recall이 가장 좋은 response를 선택하여 사용 https:..

mechanistic interpretability 2024.09.09

[Paper review] LINEAR REPRESENTATIONS OF SENTIMENTIN LARGE LANGUAGE MODELS

https://arxiv.org/pdf/2310.15154LLM이 sentiment관련 task를 풀때 사용하는 direction에 대한 연구더보기https://github.com/curt-tigges/eliciting-latent-sentiment/tree/main GitHub - curt-tigges/eliciting-latent-sentimentContribute to curt-tigges/eliciting-latent-sentiment development by creating an account on GitHub.github.com Contribution1. sentiment의 linear representation을 synthetic data에서 찾음2. 위 direction으로 실제 dat..

mechanistic interpretability 2024.09.03

[Paper review] Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

https://arxiv.org/pdf/2406.20086LLM이 multitoken sequence (multitoken word 혹은 관용구)를 어떻게 처리하는가?Contribution1. multitoken word / named entity의 residual stream 마지막 토큰 포지션에서 앞에 나온 토큰에 대한 정보를 decoder only모델 앞쪽 layer에서 삭제함2. 이걸 통해 모델이 내부적으로 사용하는 실질적 vocabulary를 파악Method이전의 interpretability 연구에서 대부분 문장 혹은 프롬프트의 마지막 포지션을 그 전체의 semantic을 encode하고 있다고 가정하고 사용했으니 실제로 어떤 정보를 갖고 있는지 알아보겠다.예를 들어 "Star"," Wars"..

mechanistic interpretability 2024.08.21

[Paper review] Function Vectors in Large Language Models

https://arxiv.org/pdf/2310.15213모델이 어떻게 ICL(in-context learning)을 하는가?Related worksICL (In-context learning)언어모델이 inference time에 적은 수의 demonstration에서 어떤 task를 푸는것인지 '학습'하는 것1. transformers are few shot learner; GPT3 논문에서 처음으로 제시됨2. https://arxiv.org/abs/2211.15661 에서 ICL이 synthetic task (linear regression)에서 Stochastic Gradient Descent임을 간접적으로 보임3. https://arxiv.org/pdf/2212.10559 에서 ICL이 일반적인..

mechanistic interpretability 2024.08.12

[Paper review] Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries

https://arxiv.org/pdf/2406.12775LLM이 multi-hop QA (closed book, zero shot)을 어떻게 하는가? 틀리는 경우 왜 틀리는가?multi-hop query : 단계적 추론이 필요한 문제들; 예를 들어 "The spouse of the performer of Imagine is" 다음 토큰을 예측하려면 1. Performer of Imagine is : John Lennon2. Spouse of John Lennon is : Yoko Ono를 단계적으로 풀어야 한다. 물론 위 정보를 한번에 저장하고 읽을 수 있지만 그렇지 않음에 대해서도 다룬다.논문에서는 정확히는 2-hop query에 대해서만 다룬다.SetupDatasetKG 형식의 데이터셋 Wikid..

mechanistic interpretability 2024.08.07

[Paper review] ReFT: Representation Finetuning for Language Models

https://arxiv.org/abs/2404.03592parameter performance tradeoff를 고려하지 않더라도 instruction tuning, commonsense에서 sota를 달성이외 GLUE, arithmetic에서도 동일 수준 parameter에서 월등이 좋은 성능을 보임.Related worksadapterattention 혹은 mlp output에 mlp(adapter) 를 달아서 PEFTLoRA와는 다르게 weight을 다른 component에 fold할 수 없기때문에 inference에 추가적인 overhead가 발생 LoRAlow rank matrix a,b로 train과정에서 weight update를 approximate한다. weight을 fold할 수 있기..

activation steering 2024.08.02

[Paper review]Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

https://arxiv.org/pdf/2311.04897모델의 hidden state(residual stream)가 당장 다음 토큰 외 그 이후의 토큰도 예측하는가?Related workslogit lens, tuned lens 등 residual stream에서 각 레이어에서 next token prediction을 만들어가는 과정을 human interpretable하게 볼 수 있음이와 연관되게 모델의 prediction 도중의 layer에서 바로 early decoding 하는 연구도 있음여기서는 그에 대한 follow-up으로 당장 다음토큰뿐 아니라 그 뒤의 토큰까지 예측할 수 있음을 보임Methodspreliminaries논문에서는 GPT (decoder only transformer)를 다..

mechanistic interpretability 2024.07.31

1 2

mech. interp blogpost

mechanistic interpretability. 딥러닝 모델을 리버스 엔지니어링하는 연구입니다. alien neuroscience :)

patch patching, tuned lens, linear representation hypothesis, multi-hop qa, linear representation, activation patching, supporting factor, input attribution, multitoken, toxicity, 논문리뷰, future lens, reft, controllable generation, representation engineering, mechanistical interpretability, mechanistic interpretability, XAI, activation steering, answer attribution,

Today :
Yesterday :

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

분류 전체보기 13

티스토리툴바