We live in an information age where we are confronted with increasing amounts of data, choices, and decisions. But what exactly is information? And how can we quantify it? Can we have negative amounts of it? In this post, we’ll explore the nature of information and its relationship to the difference operator (XOR gate).

What is Information?

Information is commonly referred to as a concrete thing: the more information we have, the better-informed decisions we can make. But information isn’t necessarily something physical; rather, it refers to insights accessible to “the mind’s eye.” We quantify it in everyday speech with phrases like “could you provide me with more information about product/person X?”. It is often associated with facts, data, knowledge or understanding. In general terms, information refers to the abstract concept whereby having more of it one becomes more certain about a certain outcome.

comic explaining information and knowledge

Nearly a century ago, a mathematician by the name Claude Shannon pondered on the nature of information. He was interested in how to quantify information in a way that could be transmitted over a noisy channel. He came up with a mathematical framework that treats information as a measure of reduced uncertainty. In this framework, more information doesn’t mean knowing more; paradoxically, it means having less uncertainty about something. Information in this framework should be understood a the degree of uncertainty about a certain outcome. Having more of it, is tied to being more uncertain about the outcome.

Shannon defined information as the expected outcome of the log inverse probability of an event. He defined the unit of information the “bit” which is the amount of information needed to reduce the uncertainty by half. This is more readily understood in terms of a concrete example to which I turn next.

The Story of Yes and No

Consider for example a simple die. A fair die will produce each face with equal probability (1/6) facing up. It may be unintuitive to ask but information theory allows to pose the question: “How much information is there in throwing a die?”. Intuitively, one may think that there is none. After all, the die throw produces each face with probability 1/6.

Information theory, however, would cast the problem in terms of the required yes/no questions to determine the outcome of a die rol. For example, say we throw a die, and I would need to figure out how much information I would need on average to determine the outcome of that die. Or in other words, what sequence of questions could I ask to determine the outcome of the die roll. One way of doing this would be to ask the question: “Was the outcome a 1?”. Another could be: “Was the outcome larger than 3?”. It turns out that information theory can quantify the minimum number of yes no questioned needed to determine the outcome of the process of a die rol. In the case of a fair die, this would be 2.58 bits. This means that on average you would need 2.58 yes/no questions to determine the outcome of a die roll. It provides a lower bound on the minimum set of questions needed to determine the outcome of a process.

Shannon originally rationalized that this new concept of information should abide by a few principles.

First, information must be additive. When two independent events occur, the total information content should equal the sum of the information from each individual event. For example, receiving two unrelated messages should yield information equal to the sum of what each message individually conveys.

Second, information must exhibit continuity. Small changes in the probability of an event should result in correspondingly small changes in its information content. There should be no sudden jumps or discontinuities in how we measure information as probabilities smoothly vary.

Third, information must be unit-invariant. The measured information content should remain constant regardless of how we choose to represent it. Just as the length of a table remains the same whether measured in inches or centimeters, the information content of a message should not depend on our choice of units or notation system.

Fourth, information must be monotonic with probability. As an event becomes more probable, its information content should decrease. This aligns with our intuition that rare or surprising events carry more information than common or expected ones. For instance, learning that it rained in Seattle (a common event) carries less information than learning it snowed in Miami (a rare event).

Shanon originally worked on this theory in the practical context of improving long distance telecommunication. In his time, voices were encoded directly into fluctuating electrical signals. These signals would degrade over time, and the original voice would be lost. Shannon’s work allowed for the encoding of voice signals into a series of bits that could be transmitted over long distances reliably. Using information theory, one could figure out how much information was needed to send a signal over a channel, effectively laying the ground work for the digital computer age we live in today.

From Wires to Communication

The informational bit is essential for determining communication in an abstract sense. Its origins lie in solving the practical problems of transmitting information over a noisy channel, but has since been applied to a variety of fields. One of the more interesting applications of information theory is in the study of multivariate interactions. In this context, we can ask questions like: “How much information is there in the interaction between two variables?”, “Does knowing the outcome of A reduce the uncertainty of B?”.

It turns out that extending Shannon’s notion of information to two variables leads to a natural understanding of how two variables can correlate. Mutual information is defined as the amount of information where knowing the outcome of one variable reduces the uncertainty of another variable. Or put more plainly, it is the amount of information shared between two variables. Formally, mutual information is defined as the difference between the entropy of a variable and the conditional entropy of that variable given another variable:

whereare stochastic variables,is the entropy of, andis the conditional entropy ofgiven. The entropy is defined as

whereencodes the “surprise” of an eventhappening; low values will generate lots of surprise and conversely carry more information.

The name (information) entropy was famously given to Shanon by John von Neumann who said when Shannon was looking for a name for his new theory:

You should call it entropy. In the first place, a mathematical development very much like yours already exists in Boltzmann’s statistical mechanics, and in the second place, no one knows what entropy really is, so in a debate you will always have the advantage.

Entropy in physics has a rich history which takes a few books to really do it justice. All I can say now is that entropy in physics relates to the amount of disorder in a system. In information theory, entropy is the amount of uncertainty in a variable — hence the similarity. Later on there are some interesting parallels that were drawn with Szilards work on the Maxwell’s demon and the relationship between information and entropy. But without getting into the weeds, I want to push ahead and talk about information entropy and its relation to multivariate interactions.

Information Theory in Complexity Science

Interactions inside a complex system could be considered as a set of interconnected communication channels. The state of each entity can be considered as a random variable. Random in this context means that the state of that variable can assume different configurations and that the configurations have a certain frequency of occurring. The configuration that this entity can assume is dependent on the process that governs the system.

The interactions within complex systems are often depicted as a simple or directed graph with interactions between entities indicated by a potentially weighted edge. The set of entities naturally call for an extension of information theory from bivariate interactions to multivariate interactions. And this is where information theory gets interesting.

From Pairwise to Higher Order

In bivariate interactions, information can reside in either part. When information is shared, it creates an overlap. This creates an information diagram that is visualized in FIGURE reference. Let’s run through this example. Consider two random variableswhich could indicate the probability of rain tomorrowand the current state of the weather. We may wonder how much uncertainty we have left about the weather tomorrow if we know the current state of the weather. This is the conditional entropy— it is the area of the circle that left after knowing the outcome of. The mutual informationis the amount of information shared between the two variables. The entropyis the amount of uncertainty in the weather tomorrow. The Venn diagram in FIGURE shows the relationship between these three quantities.

Set A: All elements in Set A

Information theory works well for bivariate cases where one can reason about the nature of interactions between variables. However, it gets more interesting for three or more variables.

The XOR gate

An XOR (exclusive OR) gate is a digital logic component that processes two binary inputs and produces a single binary output. It forms an essential logical operator that is at the core of all modern computers and is physically embodied as the transistor. If you have inputsandconnected to an XOR gate producing output, the gate follows this rule: the outputwill be 1 only when the inputs are different (when one input is 1 and the other is 0). When both inputs are the same (both 0s or both 1s), the output will be 0. In other words, it acts as a difference operator, detecting when there’s a mismatch between its two inputs.

Input X	Input Y	Output Z
0	0	0
0	1	1
1	0	1
1	1	0

The interaction betweenandin the XOR gate is quite remarkable. Examining their relationship from an information theoretic perspective reveals that knowing the value of one input tells you absolutely nothing about the value of the other input. For example, if we know inputis 1, inputcould be either 0 or 1 with equal probability, leading to outputs of 1 or 0 respectively. Similarly, knowingprovides no information about. This means there is zero mutual information between the inputs - they are statistically independent. Yet paradoxically, together they fully determine the output. This exemplifies a pure higher-order interaction where the relationship between variables cannot be decomposed into pairwise correlations.

Three-way interaction diagram

Negative Information

The XOR gate serves as the simplest example of how collective behavior can emerge from parts that show no pairwise relationships. Higher order interactions are often heralded as the cornerstone of many real-world problems: it’s what puts the “complex” in complex systems.

It further reveals a counterintuitive aspect of information theory: negative information. Negative information is a concept that arises when the mutual information between two variables is less than the sum of their individual information. In the case of the XOR gate, the mutual information betweenandis zero, but the sum of their individual information is one bit each. This means that the XOR gate has negative information.

Let’s compute the values exhaustively to see how it arises. The mutual information between,, andis defined as

In case of the XOR gate, the individual entropies are— since each input is a random bit.

The joint entropy,represents the total uncertainty in the system. Intuitively, our system represents a deterministic relationship; if we provide it with 2 inputs the output variable is fixed. Therefore, in the case of the XOR gate, the joint entropy is 2 bits. Now let’s compute it. The joint probability is given as:

therefore our joint entropy is:

Next, we move on to the pairwise interactions. The joint entropyis the uncertainty in the system when we know the values ofand. In case of the XOR these pairwise entropies provide 2 bits of information each. To check this, we can intuitive feel that knowing the outcome of two variables, we can deduce the state of the third variable. The computation is effectively the same as above except that one needs to “hide” one of the columns.

Therefore our joint mutual information is:

which is negative! Negative information is somewhat of a mystery in information theory as it currently has no physical interpretation. It also shows a limitation of the Venn diagram representation above as the mutual information between all three variables would have negative surface area, generating a confusing picture on the nature of information in multivariate interactions.

The Relation Between Parts and Wholes

Information theory initially presents powerful tools for analyzing message transmission across channels. While its applications in pairwise settings are intuitive, the concept of information becomes more abstract in multivariate contexts. The XOR gate serves as an elegant example of how components can interact in ways that transcend simple pairwise relationships.

This example demonstrates that information can emerge from the interactions between components: the system’s structure and connections are fundamental to its behavior. While some interpret negative information as evidence that “the whole exceeds the sum of its parts,” this perspective unnecessarily invokes mystical thinking. As I discuss in detail in my recent paper here, the whole isn’t greater than its parts but fundamentally different in nature.

In the XOR gate, the degrees of freedom allow for 2³ potential states but only 4 unique outcomes. The emergence of negative information is a mathematical consequence of how information is computed, rather than indicating any supernatural properties. The negative signs can be better understood as indicating that one outcome provides independent information about another outcome, necessitating higher-order interpretation beyond pairwise interactions.

These concepts demonstrate that complex systems can exhibit behaviors that require analyzing interactions at multiple levels, without resorting to metaphysical explanations.

In my mind, negative information is akin to the concept of imaginary numbers; they reflect that information moves along a plane rather than a line in which we are used to. Shannon intended for the concept of information to be a measure of uncertainty that abides by intuitive principles we all share on what information is. The XOR gate highlights that in multivariate context the concept of information can be quite counterintuitive.