‘Discovering Variable Binding Circuitry With Desiderata’

“Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of desiderata, or causal attributes of the model components executing that subtask.”

Find the paper and full list of authors at ArXiv.

View on Site: ‘Discovering Variable Binding Circuitry With Desiderata’