Anthropic scientists Discover AI Analysis Method: A Step towards Understanding the Digital Brain

Carding 4 Carders

Professional
Messages
2,731
Reputation
12
Reaction score
1,321
Points
113
Studying the functions of neurons, rather than their essence, allowed us to better understand how a neural network works.

A recent study conducted by former OpenAI employees and now collaborators with Anthropic suggests a new approach to understanding artificial neural networks. These networks, essentially digital versions of human brains, are capable of performing tasks ranging from playing chess to translating languages.

The researchers focused on combinations of neurons that collectively create distinguishable patterns or features, rather than carefully studying individual neurons. Patterns turn out to be more accurate and consistent than their individual neural counterparts, which allows us to better understand the network's behavior.

The main disadvantage of this method is the lack of a clearly defined goal for individual neurons in the system. For example, in a language model, a single neuron can respond to different scenarios by varying its activity.

The article presents a new approach to the analysis of transformer models. The technique uses vocabulary learning to break down a layer of 512 neurons into more than 4,000 different functions, covering a wide range of topics and concepts, ranging from DNA sequences and legal terminology to web queries, Hebrew texts, and nutritional data.

Such multi-faceted features remain largely hidden in the study of individual neurons. The researchers use two different methods to demonstrate the improved interpretability of these functions compared to neurons.

In the first experiment, the researchers evaluated the ease of understanding the functionality of each pattern. The characteristics significantly outperform neurons in terms of interpretability.

In the second experiment, a language model was used to create brief descriptions of each feature, and then another model was used to predict the degree of activation of each feature based on the descriptions.

New features allow you to more accurately control the behavior of the network, which is confirmed by the universality of patterns in various models. Experiments were also conducted to fine-tune the number of features, creating a "crank" for adjusting the study model.

The work is a milestone in Anthropic's quest for mechanistic interpretability, which reflects a long-term commitment to advancing AI security. This research creates a bridge between computer science and neuroscience, opening up new horizons for understanding artificial neural networks.
 
Top