Part 3. A model for an artificial general intelligence
Sensory Signal Graph
In Part 2 we established a concept of a sensory signal graph. It was described as:
Time stacked, feature clustered sensory signal nodes interconnected with other nodes through multiple edge dimensions — all receiving and often generating new data from both external inputs and internal simulations.
This is nice from a theory point of view but how would something like this work in practice. (If it sounds like gobbledygook go back and read parts 1 and 2 or at least part 2). There are groups studying this phenomenon in situ with animals and people but we’re not here to talk about biological intelligence, we’re ready to make machines think.
You can’t change the past
The first concept we need to make real is a place to store sensory input. We can’t have a sensory graph without a medium within which to connect it all together, can we? This isn’t going to be a tutorial or an implementation guide but it will describe simplified examples in sufficient detail.
What is needed is an immutable state graph. What is immutable state you ask? Immutable means that it isn’t mutated or changed, it’s a write once recording system (read only thereafter). Think about a log or a journal or an “undo” list of actions. We can also think of immutable state as a timeline of all input where the input is sensory data. This effectively gives us a time series of each node in the graph. What’s a graph? A graph is a set of connected nodes or data points. Here I’ve illustrated an example of a 3 step graph data time series. Each step represents a new record rather than a change over time. It can of course be interpreted as a change over time but that’s not how it is stored.
This is a super simple graph of arbitrary information. It’s kind of useless other than that it shows a set of interconnected points that change over time. However, imagine this graph changing over time a few thousand times a second with all of the data getting logged in a journal.
This graph data time series is to be quite honest a hot mess to try and visualize. We can pick attributes to chart however, like density of entries, frequency of entries self similarity (or measured difference) of entries. In fact measuring these aspects, this meta-information about the incoming signals will become pretty important.
Why do we need immutable state again? Well think of it this way. Ever have trouble remembering something or remember something differently from the way other’s may have remembered it? Yeah, well that’s mutable state and it’s not good. It messes with your ability to function in a predictive way, it makes you unsure of the past and so adds uncertainty to the future. When the past is immutable and recorded in this way, then future predictions are grounded in at least a self-consistent experience. We don’t want the past to change on us but we do want to learn from it as we move forward.
You can only change the future
The next concept we need is a way to take the past and turn it into a prediction of the future. Not just any old future, the future we prefer. We want to predict a future that best fits into our agenda based on what we know — then we can plan a path to get there.
Let’s not dwell on the existing solutions. Plenty of information is available about how Neural Networks, Generative Adversarial Networks, Deep Learning NNs work and classic search tree algorithms are literally at your fingertips.
We need something new-ish. We need a sensory signal classifier that outputs models for pattern recognition. It needs to be immune to noise and able to backtest like a pro. We’ll use it to test in-coming data streams and make predictions based on high bandwidth but low throughput data signals (lots of data in a short period of time).
Storing the models in an immutable structure is also important as it allows you to compare models, constantly generate new ones without losing access to the older ones and the ability to promote an older model forward for some edge-case scenario.
There are a few terms here that will need clarification. What is meant by classifier, backtesting and model. These each have somewhat variable meaning in the current state of the art.
An example classifier might take a sample of a data stream and compare it to the same stream looking for similar sequences. When a sample gets a close match, a copy of that sample goes into the model with a positive weight. When a sample doesn’t find a match over some period of time it goes into the model with a negative weight. Samples that find matches over and over again continue to be tagged or clustered, which is how we begin to deal with noisy inputs. Noise fades away over time as signal is identified and coded for.
Coding the samples involves looking at timing, frequency, density, intensity and other meta-attributes. When, how often, how much — connected together in a sort of graph, with nodes and edges.
An example algorithm that achieves most of these goals is DTW or dynamic time warping.(1) This has been successfully used to classify attributes for topics like motion capture, sensor networks and web analytics. There are a plethora of potential signal processing and data analysis algorithms available to try but the goal remains the same, find those patterns.
Its models all the way to the top but don’t forget to backtest
As you may have figured out, a model is just a collection of patterns found in a stream of data. The goal with a model is to discover which patterns are most important. For now the best we can do is to look at this meta-information (remember when I told you it would be important) which will eventually be eclipsed by higher order insights.
Backtesting is when you take a recently modeled time-series pattern and you go back and test it against older data to see if it matches historical data trends. When this happens it’s a good indication that the model may be predictive of future trends as well.
The above classifier strategy would be applied to multiple data streams and even transformed copies of the data stream (where transformations come from is yet to be described). Then you go up a level. Start classifying the classifier models. Do any of them have similar sub-graphs. Keep the best performing aggregate models to do higher order predictions and shorten the feedback loop using these to do guesstimates.
We’ve been pitting brains against brains this whole time
Hey we need some real feedback
Here’s where it gets interesting. We don’t have anything useful yet… it’s eerily similar to when you get a big puzzle and you’ve just started sorting out all the pieces by edge geometry and color, trying to find the sides, corners and major regions. Of course it’s entirely possible to continue in this way and solve the puzzle — it’s a nice challenge on a rainy day but really a much faster way to solve the puzzle is to cheat by looking at the picture. The picture provides a much needed feedback loop in our example.
Feedback from inputs, it turns out, is super important. This is currently done via supervised training or GANs with state of the art systems. These techniques have their uses but what’s amazing is that they haven’t done the obvious next step; embodiment. Yeah. Give that brain a body. I know right. We’ve been pitting brains against brains this whole time. Seems obvious to me but maybe it’s not or maybe people haven’t worked out how to provide the right kind of “body” to their systems. Body here simply means a way to get new input autonomously (versus having a person/brain/GAN feed it in).
The reality is that unsupervised learning requires breadth of experience, not depth of experience
Go big or go home, a cat isn’t a cat
A common AI example is one where the system identifies pictures of cats. A typical implementation uses a library of cat photos to train with and some smaller set of photos (not all cats) to validate with. An embodied version of this would be a system that has access to a search API that will get it more photos of cats and other things. Pretty simple. Maybe not even significant as typical training sets have hundreds or thousands of images so how would a few more help?
Here’s where we diverge from the current state of the art. A cat isn’t a cat. A thing does not exist as a binary entity of ‘thing’ or ‘not thing’. A cat is a composition of qualities that altogether equal ‘catness’, a thing is a composition of those qualities it has in common with other’s of it’s ‘type’. So truly to develop an intuition for what is a cat, we must also develop an intuition for what is hair, eyes, a tail, feet, legs, claws, whiskers, ears, body… then layer on top of those an arrangement of these attributes that map to the concept we call ‘cat’. While we’re doing this we will likely also end up mapping out a lot of other concepts, this can’t be helped and should be considered a feature, not a bug.
The reality is that unsupervised learning requires breadth of experience, not depth of experience and this is the big difference. What is really needed in a training set of images are thousands of animals, some of which are cats… with those same cats (and other animals) represented by 10s of photos of each animal from different angles. In this way the system can learn what a cat is, distinct from what a dog is or a squirrel or a ferret or a groundhog. Then when asked to identify cats in a set of photos, it can reject anything that doesn’t include an animal first, then reject any with animals that match for a different species and finally reject any that are not a close enough match to a cat.
Cross-training makes champions
Have you heard about how cross-training leads to a better overall results? It’s not a particularly controversial position to take. There is something about how one set of skills maps on to another set of skills. There is evidence that where there is overlap the skillset becomes deeper, more nuanced and more robust. Like when a maths expert takes up an instrument, an artist learns to cook or when a gymnast takes up ballet — unexpected (or sometimes expected) benefits begin to arise from the hybridization of each domain.
As you may have guessed, this is something we should want for AIG as well. So what does that mean in terms of an implementation? It means that we want speech recognition, optical character recognition and text parsing to be a part of our natural language processing system. For that matter we want speech generation and text creation and general computer vision with object identification and scene parsing to be there as well. Yeah baby, we want it all.
Okay, so the only thing missing here for some kind of android is embodiment within an organic system with muscles and vocal chords and stuff… I hear you–that would be cool, but that’s outside the scope of this document (you were probably thinking that nothing would qualify for that honor but there you go). Don’t worry though, there are options that should provide the kind of sensorimotor feedback we organics get from sub-vocalizing words and subtly acting out gestures with our body language.
Practically speaking this means a lot of classification needs to be happening. Using the previously described methods, each sensory channel can be creating a catalog of models that match up with various clusters of patterns found in the incoming data stream. Using the time code meta-information available in the immutably stored sensory graph we should be able to cross-reference and model data from one channel as being time code related to model data in another channel. Backtesting this match up should lead to a positive or negative consensus on whether or not the e.g. visual pattern for a cat, syncs up with an audio pattern for cat noises like purring or meowing.
Even though at this point we don’t even have a word for ‘cat’, it’s never been labeled, we may have a multi-faceted mental model for ‘catness’ that might include visual, motion/gate, audio, olfactory, tactile or other kinds of sensory information that are non-anthropomorphic like a heat signature, an electromagnetic field signature or anything else that could be measured. It’s not critical to have so many different channels but it is critical that the channels are in-sync.
Temporally synchronized stimuli is a great way to establish relationships between otherwise disparate data sets (we’ll go into other ways to do inferencing later). In this case however, if it so happens that our system is exposed to a photo of a cat that has a caption that says ‘cat’ and maybe speaks out loud the syllables for ‘kuh-ahhh-tuh, khahth, kat, cat’ and hears itself say the word, then we’ll have the final piece of the puzzle in place for communicating about cats in the english language. That temporal coding will provide the connectors between sensory input patterns to establish a concept model with a multi-dimensional profile.
There’s a secret weapon
Under the hood of course we’re still looking at a bit of a mess. The relationships between these different pattern samples, these attribute models, are going to be all over the place. Something like the concept of ‘fur’ is going to be connected to thousands of lower order, peer level and higher order models. Some connections will be stronger than others. Connections can be unidirectional or bi-directional. How will our system ever be able to take advantage of all this information, won’t it get bogged down in recursion or linear or worse search times? Not at all. There’s a secret weapon.
Massively parallel processes. Remember there are thousands of classifiers making predictions about the data coming in. There’s no reason why they can’t all be working on the problem at the same time.
… to be continued
Notes // additional related conversations I had with myself
This is classic AI stuff: setting up a scoring system to maximize your goal by assigning value to metrics that represent success, evaluating all possible paths then picking the one that scores highest in the metrics you care about the most (sometimes you also want to minimize a metric, e.g. fastest travel time using the least amount of fuel). The problem with the ‘evaluate all paths’ approach is that it’s a really expensive process when you first start to do it. So computer science folks have come up with all sorts of ways to avoid it. Many very clever algorithms that have been optimized to start from nothing and achieve greatness in what is not a pretty short time frame. So let’s use some of them, but not quite yet.
Let’s dwell for a moment on this ‘evaluate all paths’ problem. I don’t know about you but when I first encounter something completely new I get really overwhelmed with the possible ways that to understand it. Like a new language or a new device or new software. There are whole domains of study dedicated to how to get people from ignorant to competent in the shortest time possible.
One of the approaches has been to observe how children learn something for the first time. Another approach that has gotten a lot of attention lately is the ‘fail fast’ approach, which is to say it’s trial and error while trying to minimize any psychological impacts of error. Just keep trying stuff and you’ll learn it eventually. These two approaches seem similar.
Right, so this sounds like one of those computer science, machine learning techniques doesn’t it? Feed forward and back propagation in a “neural net”; or as an algorithm, just “backpropagation”, which essentially is a process of evaluating sample data using a function of some sort and using the difference between that result and your desired result to update the evaluation function so it gets closer to your desired goal. Trial and error.
Neural nets, deep neural nets, deep deep neural nets, GAN or generative adversarial networks — et al. They all start from scratch for every problem. You set them up with some neurons that are tuned to evaluate some discrete data set and then you train them or train them and then have them train each other (in the case of GANs). They require hundreds, thousands, sometimes hundreds of thousands or even millions of data samples to become accurate at making predictions. They are ultimate trial and error statistical champions.
What can we do different? We’ve looked at the ‘evaluate all paths’ approach and the ‘trial and error but really fast’ approach. Neither of these approaches ever gets better on it’s own though. They each can learn how to optimize themselves but rely on external input to do anything new.
1) DTW dynamic time warping. https://link.springer.com/article/10.1007/s00778-012-0289-3