Simple Ray OCR

Raymond S. Puzio



Contents

Introduction
Rationale

Introduction

This is a project to try recognizing characters in scanned books by using statistics and clustering. The name comes from Raymond Pierre de Lacaze's "Simple Ray" clustering algorithm which will be used to compute the clusters and suggests that a primary aim here is to carry out OCR in a conceptually simple fashion.

The process proposed here may be divided into four broad steps:

To see how feasible this proposal is and gain some insight into the relevant issues, in addition to simply implementing the above steps, I will also collect and analyze statistics. Examining these should provide a basis for understanding what should reasonably expected to work with what sort of rate of success and pointing out where are the weak points towards which further attention should be focused.

To keep the first attempt simple, I have started with a page from an old children's book as my subject. While this means that the size of data sets might be barely statistically significant, it also means that one can proceed quickly without getting bogged down in lengthy computations, and keeping the appropriate cautions in mind, these preliminary results should provide a miniature view of what may be expected and encourage further exploration.

Rationale

Since character recognition is a well-studied field and there are already a good number of well-developed packages written, a rationale for undertaking this project is in order. Thus, in this section, I will discuss objectives and point out what makes this project different from other OCR projects.

Focus

This project is narrowly focused on books. (And, more specifically, math books published between the mid-19th and mid 20th centuries, but that further specialization is not so important for the time being.) Thus, I do not hesitate to use techniques which depend on the specific features of these books and may not generalize to other situations. For instance, this approach is not applicable to short snatches of text such as signs and cards; for those, one needs something like the usual OCR algorithms involving topology, intersection with lines, and the like.

The two main features on which I am relying at the outset are:

The text of a book is lengthy — for instance, if a book has 300 pages, and each page has 30 rows of text 40 characters wide, that totals to some 360,000 characters. With these sorts of numbers, one can expect a statistical procedure to yield good results.

Since the books I am interested in were mass produced by an industrial procedure, one can expect consistent results within a tight tolerance. Specifically, the procedure consists of casting types from a mold, then inking them and making an impression on paper. Because of the advanced state of typographical metallurgy, the difference between two types cast from the same mold is negligible, so we only need to consider the variation arising in subsequent processes.

These variations arise from three main sources: paper texture, ink unevenness, and digitization. Examining a printed page under a microscope, one sees that the fibrous nature of the paper can not be ignored at the scale of the finest features of printed characters. Because of this and because of the printing process, the interior of a character is not uniformly black and the edges are ragged with ink blobs extending past the boundary of the typeface. Finally, the process of digitization discretizes the image into pixels of a certain size, smearing out details at smaller scales.

Understanding exactly what sort of statistical variation is introduced is a key objective of the current project. If one can understand this distribution well enough and devise a test which allows one to conclude with a reasonable confidence whether or not two marks of paper were produced by types cast from the same mold, then the project has essentially succeeded.

Monolith vs. Module

Most OCR programs are large, monolithic, black boxes which take graphic files in and return text files, perhaps with a few knobs the user can twiddle to tune the result. By contrast, what I am interested in is a collection of clear boxes which can easily be changed and rearranged by the user so as to produce all sorts of results.

Because it is based on this sort of programming philosophy, I am using UNIX as a platform, writing several short programs in C and AWK which can be combined into pipes and scripts. Since UNIX is text-based but we will be manipulating graphics here, I am using PBM as graphics format — while ridiculously bulky, this disadvantage is compensated by the ease with which it may be manipulated with text-based tools.

An important aspect of this approach is narrowly focussing on distinct aspects of the overall task. In this connection, I would like to point out that I am using the phrase "character recognition" in the narrow sense of simply identifying which characters are located at what location on the page. I am leaving out preprocessing of graphic files and postproceessing to combine the characters into larger entities like words and correct scannos, not because I deem these topics uninteresting or unimportant, but because I would rather consider them seperately on some other occasion.

Open Source

Another difference between this and most other OCR projects is that the source is open. My primary reason for this choice is not so much to produce an open source program per se as to facilitate crowdsourcing.

Preparation

Image Procurement

As the subject for this study, I took page 12 of the book "Little Facts for Little People" by S.T.C., which is a children's book published in 1867. This page was scanned, cropped to the text area, omitting the page number and heading, then converted into the file VivMop12.pgm.

A histogram of this image was computed using pgmhist and plotted with gnuplot.
Looking at the result, we see two peaks centred about 75 and 225. The former corresponds to inked areas whilst the latter corresponds to blank paper. Thus, a threshold of 150 can be used to distinguish ink from paper; using this threshold, the monchrome file VivMop12.pbm was produced.

Collecting Components

As described in the outline above, the next step is to collect together the connected components. This was done using the program pbmconnected, which produces output of the form

COMPONENT 0
24 1
24 2
. . . . .
COMPONENT 1
72 20
71 21
. . . . .

in which the connected components are numbered sequentially and each is given as a list of the coordinates of the points which lie in that component. This output was saved as the file component_list.

While this file is useful for subsequent analysis, it is not too useful for viewing by humans. Hence, I piped it through the programs align_corner.awk and coord2pic.awk in order to produce a sequence of ascii art images. I then looked at these images and added identifications of the characters represented therein, saving the result as letter.aart.ann .

Of course, the goal of this project is to have the computer make these identifications. The point of doing them by hand is to produce an answer sheet against which to compare the result of any automatically produced results. Also, this exercise will prove to be useful for computing statistics for individual characters. For subsequent use, I extracted from this a simple list of component numbers and characters and saved it in the file letter_id.

Moment Analysis

Definition

Having prepared these data, we will now begin our analysis by examining the distribution of moments of our blobs. Given a measurable subset S of the plane its moments of the first two orders are defined as follows: In physics terms, m00 is the mass, m10 and m01 are the coordinates of the centre of mass and m20, m11, m02 are the moments of inertia. Since our goal is to identify characters irrespective of their position on the page, translation-invariant combinations are useful. m00 is already unaffected by translation. The second order moments can be invariant by subtracting quadratic combinations of the first order moments. Furthemore, if we divide the result by m00, we obtain combinations which are invariant under scaling as well as under translation:
These quantites are computed by feeding the file "component_list" to the program comp_stats.awk which, in addition to the quantities m00, m10, m01, sxx, sxy, syy discussed above, also computes the minimum and maximum values of x and y for each component so as to provide bounding boxes for the characters. Its output takes the form of a series of records for the different components, a typical one of which looks as follows:
COMPONENT 6
n_pts: 404
x_min: 824
x_avg: 833.869
x_max: 844
y_min: 23
y_avg: 46.7104
y_max: 70
sd_xx: 0.0257086
sd_xy: 0.0165955
sd_yy: 0.563738
This output has been saved as the file letter_stats for use in subsequent analysis. To convert the dimensional quantities into physical units, I looked at the original book and measured that the em is 3/16 inches long. Looking at the data, this corresponds to 60 pixels, so each pixel is 1/320 inch wide. Thus, for instance the character described above as 20 × 43 pixels is 1/16 in wide and 43/320 in tall.

Plotting

In order to understand the distribution of moments, I plotted the data described above. Because they consist of discrete points, something must be done to produce a curve. Since I am only examining a single page, the numbers are to low for histograms to be of much use, I instead made kernel density plots by convoluting with Gaussian kernels.

To begin, I plotted m00 (n_pts in the data file), smoothing with a bandwidth of h=12.0:
Looking at this plot, we notice multiple peaks, presumably corresponding to the different characters. Since there is considerable overlap between these peaks, looking at this quantity alone won't suffice for more than a rough separation into small, medium, and large characters.

Also, there is the spike around 0. Looking at the annotated file, the clusters corresponding to this spike are dust and speckles, not characters so are uninteresting and can be eliminated by ignoring components with less than 30 pixels. Henceforth, this will be done.

We next plot the quantites sxx, sxy, and syy using a bandwith of h= 0.005:
Again, we see multiple peaks, so this looks promising.

An Experiment

The plot of syy has several minima which are close to zero. As an experiment, I decided to seperate the data into bins by range of syy and identify the letters which fell into which bin. This led to the following results:
bin1: 0.0 < syy < 0.05    2 items
2 dash
bin 2: 0.05 < syy < 0.1    29 items
29 dot
bin 3: 0.1 < syy < 0.17    21 items
1 a, 1 h, 9 m, 4 scanno, 6 w
bin 4: 0.17 < syy < 0.27    187 items
24 a, 1 b, 8 c, 4 comma, 1 d, 46 e, 1 g, 3 M, 24 n, 37 o, 20 s, 3 scanno, 12 u, 3 v
bin 5: 0.27 < syy 0.35    95 items
1 a, 6 b, 1 c, 2 comma, 17 d, 1 fl, 7 g, 20 h, 3 k, 12 p, 22 r, 1 v, 2 V
bin 6: 0.35 < syy 0.45    65 items
1 b, 24 i, 1 p, 3 scanno, 35 t, 1 y
bin 7: 0.45 < syy < 0.51    18 items
9 f, 1 H, 2 I, 1 j, 1 t, 4 y
bin 8: syy > 0.51    20 items
2 f, 1 I, 13 l, 2 scanno, 2 l
This simple operation did a reasonable job of sorting out the characters since, by and large, each type of character wound up in a particular bin with a few outliers in the neigboring bins. In other words, clustering along this one dimension produces encouraging results, so I feel motivated to try clustering in more than one dimension.

Clustering

I began by making a scatter plot. After trying a few possibilities, the best looking plot was one of m00sxx versus m00syy:

Looking at this plot, it appears that there are several clusters there.

The next order of business is to run the data past the simple Ray algorithm to identify these clusters. This algorithm is implemented in the program srclust.awk. A cluster diameter of 22 seems to give the best result; I saved the output of the program in the file xx_yy.ann.clust.22 and plotted it (omitting clusters of three or less points for clarity):

The separation of clusters 0, 1, and 20 came out oddly because they are so close together. The separation of clusters 2 and 6 looks arbitrary; to the eye, it looks like one big cluster, but the algorithm had to split it up into two somehow because it exceeded the maximum diameter. Thus, I next fed the output through the k-means algorithm (implemented in the program kmeans.awk, output in xx_yy.kmean.22) to see what that would do. The result looks as follows:

The seperation of the clusters near (75, 100) looks better. We still have what look like big clusters split into several little clusters. To deal with this, I decided to combine clusters, making one big cluster out of clusters 2, 6, and 13, another out of 0 and 1, and yet another out of 20 and 21. After running kmeans again with the centroids of these combined clusters as initial data, I saved the result as xx_yy.kmean.17 and plotted it to obtain the following:

Identifying the letters, we find the following:
Cluster 0: 1 b, 25 a, 45 e, 9 c
Cluster 1: 1 fl, 13 p, 17 d, 20 h, 2 V, 3 k, 7 b, 8 g
Cluster 2: 12 u, 1 d, 2 scanno, 24 n
Cluster 3: 1 scanno, 6 comma
Cluster 4: 1 h, 6 w
Cluster 5: 2 dash, 29 dot, 4 scanno
Cluster 6: 11 f, 1 j
Cluster 7: 12 l, 1 H, 1 I, 1 scanno
Cluster 8: 9 m
Cluster 9: 24 i, 3 scanno
Cluster 10: 1 l, 1 scanno, 2 I
Cluster 11: 3 M
Cluster 13: 1 e, 37 o
Cluster 14: 22 r
Cluster 15: 1 a, 20 s, 4 v
Cluster 16: 36 t
Cluster 17:  5 y
Just as in the table at the end of the last section, we see a pretty good separation of the characters with a few outliers in the neighboring bin. Of course, now that we are using two variables, we have more bins. Furthermore, a good number of these bins already only contain one type of character. To finish the job of sorting out the characters, we will use a different technique in the next section.

Distance Analysis

To finish seperating the letters, we will make use of a distance function. The distance function we shall use here is a simple mini-max function which may be defined as follows: Given a point P and a subset S of the plane, define the distance from P to S as the minimum of the distance from P to the elements of S. Next, given two subsets A and B of the plane, define the distance from A to B as the maximum of the distance from elements of A to B. This definition is not symmetric; for instance, if A is a proper subset of B, the distance from A to B will be zero whilst the distance from B to A will be greater than zero (due to points in B which do not belong to A). To obtain a symmetric distance function, we define the distance between A and B to be the greater of the distance from A to B and the distance from B to A.

A naive implementation of this distance function is found in the program blobminmax.c which simply loops over both sets minimizing and maximizing. While this takes considerable time when both input sets have around 500 members (as is the case with our data) which could be saved with more clever programming, it is still workable for the sort of preliminary exploration we are doing here.

Since the identity of a character doesn't depend on it's position on the page (of course, it's meaning does, but that is not being considered here) we need to take that into account. One way to do this is to minimze the distance over all translations of the two point sets relative to each other. Another is to align the centroids of the two sets. Trying out a few examples, I noted that the two methods give nearly the same answer when dealing with similar sets, such as two instances of the same letter and hence will use the second method because it is easier. To that end, I wrote the routine centre.awk to translate letters so that their centroid is at (0,0).

We will use this distance function to refine our clustering. Since it's such a jumbled catch-all, let's start with cluster 1. Computing the distances between pairs of characters in that cluster, ordering them, smoothing out the result and plotting it produces the following graph:

Note the peak between 0 and 5. This looks like it should correspond to the distances between different occurences of the same character, so suggests what sort of diameter to use for clustering. Running the Simple Ray algorithm with diameter 6 resolves this cluster into seven subclusters, one for each type of character. Encouraged by this success, let us have a look at the remaining clusters. Applying the same algorithm with the same value of diameter, we obtain the following:

Cluster 0
Subcluster 0: 7 b Subcluster 1: 17 d Subcluster 2: 8 g Subcluster 3: 20 h Subcluster 4: 13 p Subcluster 5: 3 k Subcluster 6: 1 fl Subcluster 7: 2 V
Cluster 1
Subcluster 0: 21 a Subcluster 1: 9c, 1 e Subcluster 2: 45 e Subcluster 3: 4 a
Cluster 2
Subcluster 0: 12 u, 1 d, 3 n Subcluster 1: 21 n, 1 scanno Subcluster 2: 1 scanno
Cluster 3
Subclusster 0: 6 commas Subbcluster 1: 1 scanno
Cluster 4
Subcluster 1: 6 w, 1 h
Cluster 5
Subcluster 0: 2 dash, 29 dot, 4 scanno
Cluster 6
Subcluster 0: 11 f, 1 j
Cluster 7
Subcluster 0: 1 H Subcluster 1: 1 I, 10 l Subcluster 2: 2l, 1 scanno
Cluster 8
Subcluster 0: 9 m
Cluster 9
Subcluster 0: 24 i, 2 scanno Subcluster 2: 1 scanno
Cluster 10
Subcluster 0: 2I, 1 scanno Subcluster 1: 1 l
Cluster 11
Subcluster 0: 3 M
Cluster 13
Subcluster 0: 1 e Subcluster 1: 37 o
Cluster 14
Subcluster 0: 22 r
Cluster 15
Subcluster 0: 4 v Subcluster 1: 1 a Subcluster 2: 20 s
Cluster 16
Subcluster 0: 36 t
Cluster 17
Subcluster 0: 5 y
It also looks like the success rate might be imporoved by better clustering. For instance, when we redo cluster 2 using a cluster diameter of 5, we instead find 12 u's in subcluster 0, 24 n's, a d and a scanno in subcluster 1, and the other scanno in subcluster 2, so the n's are all in one cluster as they should be. Unlike in the previous case where we refined our clusters by using k-means, that method does not seem immediately applicable here because there is no obvious notion of centroid of a cluster here, so I will leave this as it is for now.

Letter Statistics

Before finishing this essay, I would like to have a look at the statistics of individual letters. By understanding the probability distributions for the moments and other quantities, one can assess what degree of success is possible for any procedure to distinguish characters based on those quantities.