Using Artificial Intelligence to Evaluate Handwritten Mathematical Expressions

 

A few months ago I discovered a smart phone app, called PhotoMath, which uses your phone's camera to evaluate typed mathematical expressions (such as those found on a homework assignment). Although, this app is undoubtedly very awesome, I was surprised to see that it doesn't support handwritten expression evaluation. With enough users, it seems possible for such an app to harvest "training data" from each user and learn to read a large variety of handwritten mathematical operators/expressions via artificial intelligence algorithms.

To demonstrate this, I've built an algorithm which contains an artificial neural network (ANN) that has learned to read my handwriting from example images.

I provided the algorithm ten images containing hand written digits (0-9), a variety of letters (x, y, and m), and a variety of operators (so far, +, -, /, %, *, and =). Each character was written in my handwriting, on the chalkboard pictured above. I then provided the algorithm with some "segmentation rules" -- that is, rules on how to locate/isolate individual characters for analysis -- and an "answer sheet" which mapped each written character to it's correct value.

Through an iterative learning process called backpropagation, the ANN "studied" each example character, incrementally strengthening the appropriate neural pathways necessary for recognition. Given this small dataset, it took the ANN about 120 seconds to learn all of the provided characters. 

Below is the first test footage taken after a couple weekends of development, in which you can see the algorithm has learned to solve basic arithmetic.

Test 1 - Basic Arithmetic 

After proving the concept on simple problems, the next step was to generalized the algorithm to comprehend compound fractions. This required some tricky recursive processes that rely on evaluating the widths of division bars to infer spatial boundaries. It sounds like an obvious process (our brains do it effortlessly), but it was more challenging than I anticipated :).

Test 2 - Compound Fractions

So far so good. From here, the algorithm must be equipped with a symbolic solver so that it can interpret algebraic equations from alphanumeric expressions. Easy peasy! 

Test 3 - Algebraic Expressions

And there you have it -- it not only works, it outperforms its creator (granted, that doesn't say much)! I'm quite happy with the performance given such a small set of "training" data. I think I'll stop here, for now. Although, it is tempting to provide rules to solve systems of equations. If I take this any further, I'll be sure to post an update. 

 

How does it work?

The study of human visual perception has quite a bit of overlap with topics of computer vision. At a high level, both our brains and the above algorithm perform a four step process to read/interpret visual information. Data collection -> preprocessing -> segmentation -> recognition.

In the above application, data collection and preprocessing are accomplished via a webcam and openCV. A connection is made with a webcam from which frames are collected and sent to preprocessing functions written in python. These functions ultimately add gaussian blur to the image before applying binary thresholding to make the characters more distinct from their background. The threshold employed should ideally be adaptive to match lighting conditions. Here's an example: 

In the example above, the character is not only preprocessed, but also segmented from its frame -- that is, it's been cropped out. This cropping process is not trivial! To automatically detect/crop each character from the frame, a method of topological analysis is applied to first detect contours [1]. Approximations of polynomial representations of each contour are then generated [2] and minimum sized bounding rectangles enclosing each contour are estimated. I know, I know, I just glossed over a lot of details in the last two sentences. However, the references should get you started in the right direction if you're following along at home. And, like in many topics of computer vision, there are more than one way to verb a noun. So, feel free to experiment ;-). 

After representing each characters by a bounding rectangle, characters with unrealistic dimensions (too big relative to others, too small, overlapping, etc.) can be discarded as noise, so to speak. Each remaining character is then cropped from the original frame and resized (typically shrunk) to a predetermined dimension so that the ANN doesn't have to deal with huge or varying-sized inputs.

Before feeding each character to the ANN's recognition logic, some local rules are applied to comprehend compound fractions!

These local rules take into account the aspect ratio of bounding rectangles to identify fraction bars; it uses widths of fraction bars to infer spatial boundaries of numerators and denominators. After fraction bars have been detected, the frame is translated to a one-line expression, which is fed into the artificial neural network's recognition topology. At this point, the algorithm has no clue how to interpret each character. It has only inferred their relationship to one another in a single-line expression. For example, the above frame would be represented as follows:

(C/C) / ( (C/C) C (C/C) ) [eq 1]

Where C is a yet-to-be-recognized character.

Building the brain.

To label each cropped character C, the algorithm flattens the cropped images into vectors of length 5625 (75*75) and feeds them into a 3-layer artificial neural network with the following topology:

  1. Input layer: 5626 neurons -- each pixel of a given cropped character is given its own input neuron. 75x75 pixel characters are therefore flattened and routed to 5625 distinct neurons. There is also one bias neuron (not shown in the diagram below) within the input layer.
  2. Hidden layer: 50 neurons
  3. Output layer: 18 neurons -- 1 for each character that the algorithm has learned (0 1 2 3 4 5 6 7 8 9 + - * / x y m =). 

What does this topology mean, exactly? Each circle represents a mathematical function whose properties are motivated by the biological behavior of a neuron. In rough analogy, ANN's are built out of a densely interconnected set of simple (neural) units, where each unit takes a number of real-valued inputs (possibly the outputs of other units) and produces a single real-valued output (which may become the input to many other units) [3]. Every input to a given neuron has an associated "gain factor" (typically called a weight) which determines the contribution of that input to the neuron's overall output. 

The individual weights in the artificial neural network are configured during the training process. That is, the network must learn the weights from example data. Every pixel from a given cropped image is fed into the inputs in parallel which then propagate through the network ultimately generating a real-valued result for each output neuron. The output neuron which "fires" the strongest (i.e., the largest value)  corresponds to the ANN's "guess" as to which character it is "looking at." Recall, there are 18 output neurons, and 18 possible characters. 

 

Although the analogy isn't perfect, it's interesting to note here that the above artificial neural network contains about 5800 neurons whereas the human brain is estimated to contain a densely interconnected network of roughly 10^11 neurons; each neuron is, on average, connected to 10^4 other neurons. As you can imagine, this allows for highly parallelized biological processing. And in fact, it must allow for high parallelism -- because each biological neuron requires roughly a millisecond or more to "fire." Compared to current digital processing speeds (which are nowhere near as parallelized), this is really damn slow. However, the high parallelism allows healthy biological intelligence systems to do fairly complex things (like recognize a familiar face) in relatively short order (~100 ms is typical for facial recognition, assuming you aren't prosopagnosic.) These same pathways also occasionally result in our seeing faces that aren't there (pareidolia). I digress.

After each character has gone through recognition, the Cs in eq 1 are replaced with the appropriate alphanumeric values, and the expression is fed into a symbolic mathematical solver to be reduced/evaluated. If you're not already familiar with neural networks, you may be stuck on a few high-level implementation questions. Let me take pause here to answer questions that I typically get at this point in the explanation:

1. Okay, I roughly understand the diagram, but what exactly is the neural network at the implementation level? Is it a piece of software? A separate tool? Short answer: it's a few lines of code. In the diagram above, each circle represents an "activation" function (in my case, a sigmoid function, also known as a "squashing function", although this can be implementation specific). This model of a neuron is often called the McCulloch-Pits model, and is well represented by the image below:

 

If each synaptic weight is known (that is, the strength coefficients of each neural connection have been determined through training), then forward propagation through the entire network can be modeled quite simply with a few lines of linear algebra. This is what the code looks like:

import numpy as np

def sigmoid(arr):
    '''
    Compute sigmoid weighing function.
    '''
    return 1.0 / (1 + np.exp(-arr))


def add_ones_col(arr):
    newarr = np.ones((arr.shape[0], arr.shape[1] + 1))
    newarr[:, 1:] = arr
    return newarr


def predict(characters):
    '''
    Run each character through ANN.
    Artificial Neural Network Properties:
        1 input layer (75*75 neurons)
        1 hidden layer (50 neurons)
        1 output layer (18 possible outputs)
    '''
    lookup = ['*', '=', '/', '+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'x', 'y', 'm']
    hypothesis1 = sigmoid(np.dot(add_ones_col(characters), np.transpose(Theta1)))
    hypothesis2 = sigmoid(np.dot(add_ones_col(hypothesis1), np.transpose(Theta2)))
    p = [lookup[np.argmax(xx)] for xx in hypothesis2]
    return p

Variables Theta1 and Theta2 are matrices which contain the synaptic weights between the input/hidden layer and the hidden/output layer, respectively. These are stored into memory after training. They are representative of the ANN's memories -- all that it knows about the world. You'll likely need to do a simple example by hand to fully understand why the above code results in forward propagation through the ANN. Just know that the lines beginning with "hypothesis" are leaning heavily on linear algebra to do many computations simultaneously. Justification for the add_ones function requires further theoretical discussion and has to do with the implementation choice of using a bias neuron in the input layer. You can read about that on your own.

2. But Will, how did you choose the aforementioned neural network topology? Why is there one hidden layer? Why not 2? Or 0? And why 50 hidden neurons? Good questions, fictitious reader! There's a lot to be said about choosing a neural network topology -- a common strategy is to start simple and add complexity (layers/neurons) as needed. I probably could have gotten away with a simpler network; but I wasn't sufficiently constrained to do so. Since the run-time for training was relatively brief in my case (not a lot of training data), my design technique was somewhat heuristic. In a future update, perhaps I'll experiment with different topological in an attempt to optimize performance. 

 

Teaching the brain.

"Training the neural network" is synonymous with "deriving the weight matrices." After all, the weight matrices, mentioned in the code above, are what ultimately determine the ANN output generated from a given input. The conventional lingo used to describe this process sounds something like this: "one must employ a gradient descent algorithm to backpropagate through the neural topology, searching the hypothesis space while iteratively reducing the error between target values and network outputs." Of course, there are many implementation-specific choices one can make at this step, and I won't attempt to say which choices are necessarily "better" than others. If you try to research training methodologies you can easily get lost in these subtleties. The method I employed required creating a symbolic cost function (which associates a given set of weights with an "error" magnitude across a body of training data) then using a conjugate gradient method to minimize said cost function over the hypothesis space.

Unless you're nerdy enough to hand-code the training routine yourself (...sigh...I am), you need not worry about the mathematical nuances and instead just use a canned routine available in Matlab or python. The big picture is that you must supply cropped (preprocessed and segmented) characters to a training algorithm along with an "answer sheet" containing the correct label for each training example. The training routine should analyze the examples and return a series of weights, which you can essentially plug into the code above.

 

Conclusion

I realize this was a fairly high-level fly-by, yet still potentially theoretical enough to cause nausea. If you're still reading, I appreciate you sticking in there! If you have questions, feel free to post them in the comments section below. You can shoot me an email as well, but go ahead and try the comments section first in the chance that others may benefit from your question. 

Until next time,

Will

 

[1] Satoshi Suzuki et al. Topological structural analysis of digitized binary images by border following. Computer Vision, Graphics, and Image Processing, 30(1):32–46, 1985.

[2] Ramer Douglas Peucker Algorithm (https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm)

[3] Machine Learning by Tom Mitchell