Artificial Neural Networks

An amazing thing: Popular culture is fond of hyperbole: This or that thing is ‘amazing’, ‘brilliant’ (in the UK), or ‘incredible’, when in truth the thing is often neither remarkable nor surprising. However, it is not an exaggeration to say that the AlphaGo story is amazing and even ‘awesome’. In March 2016 AlphaGo defeated South Korean Go professional Lee Sedol, the strongest human player of the game. That part of the story is told in a 90-minute documentary, AlphaGo - The Movie. But, the story does not end there! The team that developed AlphaGo continued to explore ways in which the methodology could be strengthened, and in doing so produced a new version appropriately named AlphaGo Zero. The new version had zero reliance on human knowledge of the game. Through self-teaching only, AlphaGo Zero discovered novel strategies, concepts or plays that had not been seen before in the deep history of Go. In a test match it defeated the original AlphaGo program 100 games to 0.¹

Amazing things spring from all fields of science, not just computer science. Every week we read of new discoveries in biology or chemistry or materials science, or some other scientific field. Often the details of these ‘breakthroughs’ are only accessible to specialists in the field of discovery. Nevertheless, from the layperson’s perspective, it seems possible at least to appreciate them in the same way one appreciates or enjoys art or music. Indeed the effort to absorb this flow of scientific and technological discovery could conceivably enhance one’s enjoyment of life. This page is about my attempt to appreciate more fully one of the ‘amazing’ developments that contibuted to the AlphaGo story.

Tufted Titmouse

     Learning in birds and humans: About a century ago psychologists and neurophysiologists began to speculate about how humans and animals learn.² At first the focus was on external features of the process, the thing to be learned and the response of the learner. As so-called ‘stimulus-response’ theories matured they gave rise to further speculation as to possible underlying mechanisms.

    Did you know that the tufted titmouse can remember up to thousands of places where it has hidden seeds? I only recently learned this interesting fact (assuming it is a fact). However, psychology proved long ago that even pigeons are capable of learning, and worms too, provided they haven’t been eaten by pigeons.

    Clearly human learning takes place in the brain. Stubbing a toe does not appreciably affect learning, whereas a moderate bump to the head puts at least a temporary stop to it. The human brain is packed full of nerve cells or neurons. There is other stuff in there too, but mostly neurons and their insulation.³ In order for learning to occur some sort of change must happen that involves neurons. Eventually theorists proposed that for learning to take place the connections between neurons must evolve in some fashion. Details came along later—They are still coming.

     Theory also holds that human learning takes place in stages. Some new fact or concept may lodge into memory briefly and then be lost, unless it is somehow consolidated. Subjectively it seems that a newly acquired fact is learned instantly, but for most people—certainly for myself—the phrase ‘use it or lose it’ applies. At some point in the future, for example, we may recall that certain bird species can remember a great many locations for stored food, but then struggle to dredge up the name of an example species. Artificial neural networks do not suffer this particular concern. Instead they may be designed to reproduce those aspects of real neural networks that appear most relevant to learning, while ignoring or deemphasizing features that are less obviously useful.
Neuron

     Neurons: The pretty picture on the right is from Wikipedia (reproduced under terms of a Creative Commons license). Real neurons can have many inputs, up to thousands, called dendrites, and one output (axon). Neurotransmitter substances convey signals across the junctions between neurons (synapses). Familiar neurotransmitters are acetylcholine, dopamine, serotonin, etc.

    Neuronal inputs may be differentially sensitive to different neurochemical substances and in different amounts. However, once the total amount of input stimulation exceeds a neuron’s firing threshold, an ‘action potential’ is generated. This electro-chemical event propagates along the axon from the cell body toward the neuron’s output.

    An elementary fact about neurons is that they either fire or they don’t. The ‘action potential’ is an all-or-nothing phenomenon. While artificial neurons are not encumbered by this constraint, some neural network architectures intentionally adopt it. —Although variability in the amplitude of the action potential is not considered functionally significant, neurons do fire at different rates. After a neuron fires it is refractory for a brief interval, whereupon it can fire again. Thus, in theory the rate at which a neuron fires correlates to potency.

    The more rapidly a neuron fires, the greater its influence on the neurons upon which it impinges, and indirectly upon their successors. Neurons may be excitatory or inhibitory in their effect on the cells that they connect to. If connections between neurons can be strengthened (or weakened) then it is a small step to imagine that neuronal-level changes can affect the totality of an interconnected network in such a way as to promote learning.

    Another ‘fact’ recalled from long-ago school classes is that myelinated neurons conduct impulses much faster than non-myelinated ones. Myelinated neurons are fast (on the order of 100 meters / second, give or take), but much slower than the slowest computer, let alone the arrays of graphic processors that power some machine learning applications. There is a great deal more to the neuron story, as can be inferred from the Wikipedia image, but these are the basic attributes that carry over to the artificial neural networks context.

    Artificial neurons,⁴ the participants in artificial neural networks, simulate selected features of real neurons. In particular they may have from one to many inputs, but only one output. Their inputs are differentially sensitive ( “weight” parameters) and their outputs include an additional additive constant term called ‘bias’.

    The artificial neuron’s activation function is the analog of ‘all-or-nothing’ firing in real neurons. In some artificial neural networks activation is a ‘step function’ that outputs ‘1’ when the input exceeds a specified value, and ‘0’ when the threshold is not met. This type of activation produces a binary output, just as the action potential either occurs or doesn’t. However, artificial neural networks may also implement other types of activation, thus facilitating a variety of experimental learning strategies.

    The preceding paragraphs describe essential features of real and artificial neurons. Real neurons interconnect to form networks that are capable of learning. The rest of this story is about how networks of artificial neurons can be programmed to learn.

Transform Rectangular to Normal

Initializing a neural network: The first step in neural network programming is to construct a network, or specifically to define the architecture of the network to be trained. The working part of the network (between inputs and outputs) consists of interconnected layers of neurons. See, for example, the ‘dense’ neural network depicted in the training cycle diagram below. That network (or the segment reproduced) consists of 3 layers of 8 neurons per layer. Straight lines represent connections. Biases are not shown, but each neuron (circle) should be assumed to have an associated bias and an output activation function, also not shown..

Prior to training, neuron weights are generally initialized to random Gaussian values. Biases can be similarly initialized, or they can be set to 0. In computer languages the most commonly implemented random function produces a rectangular distribution, although mathematical programming languages generally also have the capability to produce normally distributed random numbers. If the implementation’s computer language does not have native Gaussian random capability, the deficit is easily remedied using a method called the Box-Muller transform. Alternatively a Gaussian distribution can be computed externally, for example using FreeMat, and values imported to the implementation environment as needed.

     Training a neural network: The role of feedback in human learning takes many forms: expression of approval or disapproval from mother to infant, high or low marks on a school exam, the proverbial burned finger on touching a hot stove, the hangover that follows a night on the binge, and so on. Some kinds of feedback may be awkward to quantify, but they nevertheless produce identifiable gains in knowledge or skills. For machine learning, it is essentially the same story.

    School students and those who remember their school days are familiar with true/false tests. Suppose that a Geography teacher gives her class a true/false test. “Quito is the capital of Ecuador. __(T) __(F)” The student checks (T). That is plus 1 point or minus 0 points, depending on how the teacher scores the test. “Quebec is the capital of Canada. __(T) __(F)” The student checks (T) again and this time the teacher counts minus 1 point in scoring the test. For each question there is an Expected answer and an actual answer, the student’s Output. The score for each question reflects whether the Output conforms to the Expected answer or not. Such a score is easily quantified and accumulated on a question by question basis.

    True/False tests force the student to make a binary decision, which in some cases is an educated guess. “Qatar borders the Mediterranean Sea. __(T) __(F)” The student ponders.. Qatar is somewhere over there. “I don’t think that it is on the Mediterranean, but then I could be wrong.” What if instead of checking true or false the student could indicate a degree of confidence, a percent sure number, with 100% corresponding to definitely true, and 0% meaning certainly false. To hedge his bet the student marks 20% as the answer. How wrong is his answer? 20% confidence that the statement is True corresponds to 80% confidence that it is False and since the correct answer is False the student should be awarded +0.8 on that question or penalized -0.2.

Neural network outputs are not necessarily like True/False answers, or degree-of-confidence responses. Indeed neural networks often have multiple outputs, in some cases large arrays of outputs. However, the examples I have personally studied have been mainly of the single-output type, where the output represented a binary classification such as True/False or Win/Loss. Typically the output is a number between 0 and 1—not a confidence level, as such, but a value that evolves toward the limiting values 0 and 1 as ‘training’ progresses.

Neural network training consists of running multiple learning trials (3-part cycles). In part 1 ‘forward propagation’ the network consumes one or more inputs and produces an output, somewhat analogous to the geography student’s %-sure guess. In part 2 ‘back propagation’ the teacher scores the network’s output, assigning an Error (also called the Loss function) that reflects the degree to which the network missed the mark on that particular training trial (formula above). Back propagation also does something quasi-magical. It apportions responsibility for the error among the neurons that make up the network, not equally, but in relation to their individual contribution to the error. Finally, the last of the three parts of each training cycle updates the network by revising neuron weights and biases using the computations of part 2, together with another overall parameter that is appropriately named Learning Rate. Of the three components that make up each training cycle two are easy to describe and understand, while one of them ‘back propagation’ is more challenging (or was to me).

Forward propagation at single neuron

Forward Propagation: In a perversely reductionist sense, the simplest possible network (the ‘null’ network) would consist of input and output only, with no middle or ‘hidden’ layers. Forward propagation would perform a simple linear transformation of the input, up to activation. Output activation might do nothing (identity function), or might convert the transformed input to ‘0’ or ‘1’ (a step function), or could activate the result in other ways. Realistically, though, any sort of functional neural network would need to include at least one hidden layer between input and output. As it happens, forward propagation is the same across the entire network. Activated outputs of each layer serve as inputs to the following layer, same as in the null network example. Within a layer, each neuron’s inputs are multiplied by the corresponding weights and these products are summed. After adding bias, the output is passed through the activation function. This activated output serves as an input to the next layer.

     Activation: To reiterate, activation is the analog of a biological neuron’s all-or-nothing attribute and artificial neural network models implement activation in a variety of ways. The sigmoid function is like an asymptotic step function. it goes to 0 or 1 in the limit, but is never exactly 0 or 1. Moreover, its derivative, which is needed in the back propagation step (as will be explained), can be efficiently computed, when the function itself has already been computed: σ'(x) = σ(x) / (1 – σ(x)). [See this page, for example.] Another common choice of activation function is ReLU, which stands for ‘rectified linear unit’. The name is more abstruse than the function itself, which simply maps an input to its unmodified raw value if it is positive, and otherwise to 0. In other words, ReLU(x) is x , when x>0 and is 0 when x ≤ 0. This computationally efficient activation function is seen in many instructional exercises, as well as being widely used in production applications.

     Back Propagation: Back propagation is based on the ‘rate of change’ concept. Imagine traveling along a road that has hills and valleys. The steeper the hill the greater will be the rate of change in elevation. Going uphill the change is positive (increasing elevation), and downhill it is negative (decreasing elevation). The change in elevation is zero at the top or bottom of a hill or when the road is flat.

    Switching from hilly roads to neural networks,… with the latter it is possible to compute the rate of change in the loss function (its slope) with respect to individual neuronal weights and biases throughout the network. To do this you start at the output end of the network and work backwards using a tool from mathematics called ‘the chain rule’. It is an iterative process, where the rate of change with respect to the raw output feeds back to the penultimate network layer, and from there to the next preceding layer, and so forth, all the way to the beginning of the network.

    The rate of change computation depends on the type of activation function. As explained above, the sigmoid activation function scrunches the entire number line from -∞ to +∞ into the interval (0, 1). It is not hard to guess how this particular activation function would be useful in taming a neuron or network’s raw output. In the graphs of the sigmoid function accompanying the activation paragraph above, it is clear that the rate of change is greatest at the middle (x = 0) and least at the asymptotes. Also, recall that prior to training, the neural network is initialized with random Gaussian weights (mean = 0). However, as the network learns to classify inputs, corresponding activated outputs will tend toward either ‘0’ and ‘1’ (the expected outputs) and the absolute error will decrease.

Applying updates: Updating weights and biases is the third and final part of each training cycle. As described above, the back propagation process stores a gradient of partial rates of change for each output and neuron in the network. When back propagation is complete the ‘update’ component of each learning cycle applies these stored values to revise the network’s weights and biases. The figure above illustrates the process for one selected weight in a network that consists of three hidden layers of three neurons per layer. Specifically the illustrated calculation is updating the third weight of the first neuron in layer 2. This weight applies to the activated output of the third neuron in the preceding layer. Similarly updating of a bias is illustrated below.

    Learning rate is a constant that controls the fineness or coarseness of updating, hence the overall rate at which outputs change over successive learning cycles. The value ranges from 0 to 1 and is typically ascertained through trial and error, as are other network training parameters. Bias is a property of neurons, not inputs. Hence each neuron in the network has a single associated bias. Only gradient and learning rate participate in updating bias.

     (Human) Learning Styles: Just as there are different models of machine learning so also people learn in different ways. ‘Learning by doing’ is one way in which to approach the challenge of understanding a new concept. The method does not work for everyone—or for every concept. Some ideas are too abstract or too broad in scope to be tackled in this way. It would not be practical to reproduce at home a scientific finding that was obtained using a million-dollar instrument. Luckily, simple neural networks do not require such costly technology. The following paragraphs summarize my self-study to the time of this writing. The quest continues!

    About four years ago (August 2017) I installed a TensorFlow based image classifier model on a Raspberry Pi. This was my first direct exposure to a machine learning application. Of course installing and exercising this application did not require understanding of how it was designed or ‘trained’, or how it was able to do what it did.

    The Python image classifier labeled about 64,000 images on my computer. Most were photos with either serial number or date-based file names, in other words uninformative names. The purpose was to facilitate identifying and then locating photos or other images whose file locations had been indexed in a database, and for which I had previously created a browser-based user interface.⁵ Labels associated with the selected image above were surprisingly appropriate, and in general labels generated by classify_image.py were better than having no meaningful labels for the photos. However, a great many were bizarrely wrong.

    Around the same time as I was experimenting with the TensorFlow model for labeling images, my wife installed and examined the Intel Neural Compute Stick 2 demo on an Atomic Pi. Much later I attempted to implement the same or a similar demo under Windows 10, but did not succeed. These explorations were our first introduction to machine learning applications. Although they were useful in the demonstration sense, neither of us learned much from them—other than to persevere in the struggle to make them work!

    Many neural network instructional examples are coded in the Python programming language. That may be due to Python’s rich numerical and matrix assets (the NumPy library, for example), or possibly it is because Python ranks first in ‘popularity’ among programming languages, according to https://pypl.github.io/PYPL.html. My personal favorite language is not on the list! Of course, MUMPS is singularly unsuited to neural network programming. On the other hand it should be possible to implement a ‘Hello, world!’ example in any language.

    Image classification is a favored example for demonstration exercises. More generally, classification problems of various types are found in introductory articles about neural networks. This makes sense, in light of the hope or claim that artificial neural networks should learn to generalize beyond training data to additional examples that should be similarly classified. However, from the point of view of the student who seeks a basic understanding of how neural networks work, classification problems that rely on complex inputs seem daunting. One wishes for a simpler starting point.

    NIM: One thought that occurred to me was the game of NIM. This is a paper and pencil game⁶ (or it can be played with objects) that is of about the same complexity as Tic-Tac-Toe. I confess that at first my thoughts erred on the grandiose side, thinking by analogy to AlphaGo of a program that would learn NIM by playing against itself. After reading many articles and their accompanying illustrative exercises, I revised the goal. Instead of learning to play the game, my program would learn to classify game configurations as either winning or losing for the player on move. Inputs would be simple 3-valued vectors and there would be only one output. In 3, 5, 7 NIM, 190 game states or patterns are possible between the start 3, 5, 7 and end 0, 0, 0 of the game (4 ∙ 6 ∙ 8 – 2).

The artificial neural network that learned the 190 win/loss patterns in 3, 5, 7 NIM consisted of 3 inputs and 3 ‘hidden’ layers of 16 neurons per layer, plus one output (win or loss). The network architecture shown above⁽⁴⁾ is not the first that I tried with NIM, but the first to produce 100% ‘learning’ of all game positions, and in a relatively short time frame (roughly 20 million training trials). The programming language was MUMPS.

At the start of training, input weights were initialized to random Gaussian values, with mean 0 and standard deviation 1. Biases (not shown in the diagram) were initialized NIM percent correct

to 0. Training consisted of repeating many training cycles. In each cycle a randomly selected game configuration was presented as input. The network ‘guessed’ whether the pattern was a win or a loss for the player on move—the forward propagation part. After each such guess, feedback was propagated from the loss function backwards through the network. After back propagation was complete, neuron weights and biases were updated, and the cycle repeated with another randomly selected input. In the training report (right) the third parameter ‘0’ refers to an input transform that was not used in this specific exercise. The last parameter ‘.01’ is the criterion for correct classification. The value .01 means that to be counted as ‘1’ (win) the sigmoid activated output must be > 0.99. Similarly, to count as classifying a loss the activated output must be < 0.01. This value is not a probability, but can be thought of in a similar way.

    The annotated listing above shows how a different network (5 layers of 5 neurons each) learned to classify a selected subset of training vectors, reflecting a balance of winning and losing configurations, but otherwise randomly selected. In this example, input vectors were transformed. However, I concluded later that transforming inputs had no effect, except possibly to reduce arithmetic precision.

    Study Tools: Surely MUMPS would rank near last among programming languages one would choose for learning about artificial neural networks. However, the trouble with Python is that it is too easy to coast through a plethora of published examples, and be persuaded of understanding while not exerting sufficient original effort. My personal study did use other resources than MUMPS including many published Python examples.⁷ Without doing this I would have been lost.

    Once each part of the training cycle had been coded and tested I thought of making the MUMPS code self-document. To that end, copies of the forward, back, and update functions were modified to generate LaTeX documentation in place of calculating values. The trouble with this idea was that except in the case of the simplest test networks, output was too voluminous to be useful. For what it’s worth, the text of illustrations accompanying the ‘Applying Updates’ section above were generated in this way.

    In general it is not possible to reverse engineer a neural network in order to gain insight into how it does its magic. Numeric values assigned to weights and biases by the training process cannot be interpreted in terms of an external conceptual model. The fact that model-level meaning is completely obscured within the network is one of the features that distinguish this type of machine learning from various other attempts to emulate human thinking and decision making algorithmically.

    In the past the term ‘artificial intelligence’ (AI) was associated with ideas that were not all that intelligent. However, over time the term’s meaning has evolved to encompass a range of specialized research endeavors that individually go by more specifically descriptive names. The study of artificial neural networks is one of many currently active AI research areas. These paragraphs have explored only the simplest of artificial neural networks. Advanced models currently surpass human capability for learning in specialized areas, and could one day overtake us in learning about the world in general. If they do, humans will be compelled to adapt to a new order, perhaps the ultimate challenge of research in this area.

Endnotes

1. Go board image (public domain) from Wikipedia.

2. Edward Thorndike image (public domain) from Wikipedia.

3. Some sources suggest that glial cells outnumber neurons in the brain, but the evidence for this claim is unclear. Suffice to say that neurons are the most interesting of brain contents. They are what defines a brain in the common functional sense.

4. The artificial neuron diagram along with other similar illustrative diagrams on this page was produced by this wonderfully flexible neural network diagram tool, with annotations added in SnagIt.

5. In another page I described the enterprise web development tool called EWD.js that was used for this project.

6. The goal in the normal form of NIM is to pick up the last remaining object (or cross off the last mark). —Misère is the opposite. At the start there are 3 piles of objects (other starting configurations are possible). The most common game starts with 3, 5, and 7 objects in the three piles. Each play consists of picking up one or more objects from one pile only. You cannot pick some from one pile and some from another—That would not be a game! With perfect play, the first player wins. Moreover, it is easy to classify each possible configuration as either a win or loss for the player on move. If n₁, n₂, and n₃ represent the number of objects in each pile, then the game state is a loss for the player having the move if and only if n₁ ⊕ n₂ ⊕ n₃ = 0, where ⊕ stands for the bitwise XOR operator. Since the order of piles has no bearing on the game, for example, 0, n, n is the same as n, 0, n, etc., human players have only to remember a few key patterns in order to play perfectly. Try playing the JavaScript 3, 5, 7 NIM game below —

How to play: In the vertical bar to the right of the group you wish to reduce, indicate the number of stones to leave (not the number to remove). Then click ‘Enter Play’.
After each of your plays, click ‘Computer’ to display the computer’s reply. At the beginning of the game, you may click ‘Computer’ to allow the computer to play first.

Message:

7. Among the many resources that facilitated this study, the following stand out: Neural Networks from Scratch in Python (book by Harrison Kinsley & Daniel Kukieła), ‘How to’ article by Steven Miller part 1 and part 2, and another excellent ‘How to’ article with rich coding examples by Jason Brownlee.

Home Projects Home

Project descriptions on this page are intended for entertainment only. The author makes no claim as to the accuracy or completeness of the information presented. In no event will the author be liable for any damages, lost effort, inability to carry out a similar project, or to reproduce a claimed result, or anything else relating to a decision to use the information on this page.