Accessing the hidden structure of complex systems using information theory

One of the more useful tools in the complexity scientist’s toolbox is information theory. Now, don’t worry, I’m not going to dive into this much, but I do want to talk about the central concept to information theory called Shannon entropy (or information entropy as it is also known). 

In 1952, Claude Shannon – a research engineer at Bell Laboratories – was tasked to invent a method for improving the transmission of information between a transmitter and a receiver. His invention – which he called ‘a mathematical theory of communication’ – was based on a very simple idea: surprising events carry more information than routine events. In other words, if you wake up in the morning and the sun has turned green, then that is going to jolt you into a hyper-aware state of mind where your brain is working overtime to try and make sense of what is going on. When our interactions with friends or our environment reveals information that we were not expecting, then we seek to make sense of it. We process that information with a heightened sense of consciously doing so. 

This response to surprise is no different whether we are individuals (), in a team (discovering that a colleague is also a part-time taxidermist), an organisation (the sacking of a well-respected CEO) or an entire country (the death of Princess Diana). We seek to understand why and in seeking to answer this, we traverse Judea Pearl’s ladder of causation. However, there is one key difference. When we are dealing with a complex system, or situation, then there is uncertainty over cause-and-effect. This uncertainty is the result of a structural motif of a complex system – feedback loops which I will discuss in a future post – that leads to what is called non-linear behaviour.

Information as a level of surprise is measured in binary digits (bits). The more unlikely an event is to occur, the higher the information that is generated if it should occur. Let me illustrate this with the example of flipping a coin. 

When you flip an unbiased coin there is a 50/50 chance of it landing on heads or tails. Because both events are possible – it was heads, or it was tails – then our uncertainty of the result is at its peak. We cannot have more certainty that the coin will land heads up. Here, the Shannon Entropy of flipping an unbiased coin is 1 bit which is the maximum information that can be obtained from a system (a coin flip) that can only generate two outcomes (heads or tails). 

Now, let’s assume that we’ve been given a biased coin that always lands on tails. We know that the coin is biased and so there is no surprise for us when the coin always lands on tails. If there is no surprise, then there is no information. The chances of the coin landing on tails is 100%. In this case, the Shannon entropy is 0 bits. Certainty does not yield new information.

Now, we don’t need to be too concerned with whether something is 1 bit, or 0.5 bit, or 0 bits or whatever. The point I am making here is that the greater the uncertainty we have about something, the greater the information we can gain from that situation. Likewise, if we have total certainty then there is no information, no knowledge, to be gained. Intuitively this makes sense – if I am observing something that is not changing then I am not learning anything new about it. However, if I perturb that system – add a component, remove a component – then I may be cajoling the system into a different state. This new state may yield new information, especially if I have managed to move the system into an improbable state. (Incidentally, this is why the modes of creativity – breaking, bending, blending – are fundamental to discovering new knowledge).

For Shannon entropy to be used in more practical ways, a probabilistic model of a system would need to be constructed. This simply means that we have identified the different states that a system can occupy, and we have estimated the likelihood of the system being in that state at a moment in time. We can construct a probabilistic model through observing and recording the frequency with which different states are observed. The more frequently we observe the system in a given state, over time we may infer that the system is more likely to be found in that state at a future point. Ordinarily we need to capture enough of the history of the system for us to have sufficient confidence in the probabilistic model we are building. This learning takes time and requires continual sampling of the environment; and there are some challenges to solve – like how to represent the environment – but the idea is to invest time in building a probability distribution, a probabilistic model, of our environment. Novelty is a previously unseen state and so that too should trigger a response, not least requiring an update of our probabilistic model.

As we build our probabilistic model we are forming a hypothesis, an untested belief, about how the environment behaves. Every time we observe and capture the state that the system is in, we are testing that hypothesis. The Law of Large Numbers is relevant here. We expect to see a system move in and out of different states. It may spend more time in one state than we have observed before, or the opposite. We would need to see a persistent, recurring change in the frequency with which each state of the system is observed before we begin to suspect that our hypothesis of the system may need to be re-visited.

Now that we have constructed a probabilistic model of our environment (or, indeed, any system of interest), we can calculate its Shannon Entropy. If we have a good degree of confidence that our probabilistic is sufficiently correct, then we can baseline these measures. We can then set a sampling rate of how often we re-calculate the Shannon Entropy of the probabilistic model (we may use machine learning techniques to optimise the sampling rate). If the Shannon Entropy measurement begins to diverge from the baseline value – by some pre-determined tolerance of +/- x bits – then we could infer from this that the system may be changing in some way. This out-of-tolerance measurement could flag the need for further investigation – either by an intelligent agent or a human.

What I am describing here is an idea. I am not aware of any existing technique, or concept, that achieves this. Neither do I know if there is much utility in what I have described. I believe it is technically feasible – the computational complexity of updating a probabilistic model and calculating its Shannon entropy can be achieved in polynomial time (i.e. very efficiently). As such, you should interpret this for what it is; an idea that I hope interests people enough to pursue it further. 

I believe the utility of this technique – of parsing the environment and comparing it against a probabilistic model – could be a very efficient way to manage a vast amount of automated monitoring of an environment for changes that may warrant further investigation. Of course, this ‘further investigation’ would call into play more expensive resources such as AI and/or humans.

My motivation for conceiving of this idea comes back to the need for any organisation to become highly proficient at anticipating change. When the organisation’s environment (internal or external) may be changing in unexpected ways, then we want to be observing the change as it is happening in real-time, rather than analysing after the event. Why is this important? If we are observing the genesis of an enduring change in our operating environment, then we have the opportunity to gain insights to the causes that led to that change.

Applying Shannon Entropy as an early warning system signals an alarm that our knowledge of our environment may no longer be accurate. We can respond to these warning signals by expending effort to understand the changes that may be occurring. From this we may create new knowledge and, therefore, update our semantic graph to represent that new understanding. The semantic graph is critical, because all of our collective intelligence draws on it to make good decisions. If that semantic graph is erroneous or significantly out of date, the quality of our decisions are impacted. As an organisation harnesses AI to the fullest – where we are talking about millions, if not billions, of decisions being taken every second – then the accuracy of the semantic graph becomes a critical and protected asset.

Anticipation gives us time to prepare; yet to accurately anticipate our environment we need to be sufficiently open to detecting changes that suggest that our understanding of the environment may no longer be up-to-date.

I’d like to finish this discussion by making one final point. The use of information theory to measure the behaviour of a dynamic system is not a new concept. Indeed, information theory is one of the most promising tools in the complexity scientist’s toolbox for unravelling the mysteries of a complex system. One of the biggest challenges for the complexity scientist is having access to information about the system of interest. Most of the time we simply cannot access a complex system with the tools we have. To give just a few examples: the brain, the weather, genetics. It is neither practical, nor feasible, for a complexity scientist to have access to every element or aspect of systems of this kind. Yet we are not without hope. As long as we can capture the signals, the data, the transmissions, from these systems then we can begin to understand the system, even though it is hidden from us. Of course, as we gain more knowledge of these systems, we can then devise precise interventions that may yield crucial insights that either confirms our hypotheses or takes us completely by surprise. 

Up until recently, I had been researching the use of information theory to infer the causal architecture of a system. Techniques such as Feldman & Crutchfield’s causal state reconstruction, or Schreiber’s Transfer Entropy, or Tononi’s Integrated Information Theory were all part of my toolkit. They are all valuable as they can tell us something interesting about a complex system. However, they do not have the explanatory power of causality, especially Judea Pearl’s do-calculus. I pass on this observation here to those readers who may be more familiar with these subjects.