The process proposed here may be divided into four broad steps:
To see how feasible this proposal is and gain some insight into the relevant issues, in addition to simply implementing the above steps, I will also collect and analyze statistics. Examining these should provide a basis for understanding what should reasonably expected to work with what sort of rate of success and pointing out where are the weak points towards which further attention should be focused.
To keep the first attempt simple, I have started with a page from an old
children's book as my subject. While this means that the size of data sets
might be barely statistically significant, it also means that one can proceed
quickly without getting bogged down in lengthy computations, and keeping the
appropriate cautions in mind, these preliminary results should provide a
miniature view of what may be expected and encourage further exploration.
Since character recognition is a well-studied field and there are already
a good number of well-developed packages written, a rationale for undertaking
this project is in order. Thus, in this section, I will discuss objectives
and point out what makes this project different from other OCR projects.
This project is narrowly focused on books. (And, more specifically,
math books published between the mid-19th and mid 20th centuries, but that
further specialization is not so important for the time being.) Thus, I
do not hesitate to use techniques which depend on the specific features of
these books and may not generalize to other situations. For instance, this
approach is not applicable to short snatches of text such as signs and cards;
for those, one needs something like the usual OCR algorithms involving
topology, intersection with lines, and the like.
The two main features on which I am relying at the outset are:
Since the books I am interested in were mass produced by an industrial procedure, one can expect consistent results within a tight tolerance. Specifically, the procedure consists of casting types from a mold, then inking them and making an impression on paper. Because of the advanced state of typographical metallurgy, the difference between two types cast from the same mold is negligible, so we only need to consider the variation arising in subsequent processes.
These variations arise from three main sources: paper texture, ink unevenness, and digitization. Examining a printed page under a microscope, one sees that the fibrous nature of the paper can not be ignored at the scale of the finest features of printed characters. Because of this and because of the printing process, the interior of a character is not uniformly black and the edges are ragged with ink blobs extending past the boundary of the typeface. Finally, the process of digitization discretizes the image into pixels of a certain size, smearing out details at smaller scales.
Understanding exactly what sort of statistical variation is introduced is a
key objective of the current project. If one can understand this distribution
well enough and devise a test which allows one to conclude with a reasonable
confidence whether or not two marks of paper were produced by types cast from
the same mold, then the project has essentially succeeded.
Monolith vs. Module
Most OCR programs are large, monolithic, black boxes which take graphic files
in and return text files, perhaps with a few knobs the user can twiddle to
tune the result. By contrast, what I am interested in is a collection of
clear boxes which can easily be changed and rearranged by the user so as to
produce all sorts of results.
Because it is based on this sort of programming philosophy, I am using UNIX as a platform, writing several short programs in C and AWK which can be combined into pipes and scripts. Since UNIX is text-based but we will be manipulating graphics here, I am using PBM as graphics format while ridiculously bulky, this disadvantage is compensated by the ease with which it may be manipulated with text-based tools.
An important aspect of this approach is narrowly focussing on distinct aspects
of the overall task. In this connection, I would like to point out that I am
using the phrase "character recognition" in the narrow sense of simply
identifying which characters are located at what location on the page. I am
leaving out preprocessing of graphic files and postproceessing to combine
the characters into larger entities like words and correct scannos, not
because I deem these topics uninteresting or unimportant, but because I would
rather consider them seperately on some other occasion.
Another difference between this and most other OCR projects is that the source
is open. My primary reason for this choice is not so much to produce an
open source program per se as to facilitate crowdsourcing.
A histogram of this image was computed using pgmhist and plotted with gnuplot.
Looking at the result, we see two peaks centred about 75 and 225. The former corresponds to inked areas whilst the latter corresponds to blank paper. Thus, a threshold of 150 can be used to distinguish ink from paper; using this threshold, the monchrome file VivMop12.pbm was produced.
. . . . .
. . . . .
in which the connected components are numbered sequentially and each is given as a list of the coordinates of the points which lie in that component. This output was saved as the file component_list.
While this file is useful for subsequent analysis, it is not too useful for viewing by humans. Hence, I piped it through the programs align_corner.awk and coord2pic.awk in order to produce a sequence of ascii art images. I then looked at these images and added identifications of the characters represented therein, saving the result as letter.aart.ann .
Of course, the goal of this project is to have the computer make these identifications. The point of doing them by hand is to produce an answer sheet against which to compare the result of any automatically produced results. Also, this exercise will prove to be useful for computing statistics for individual characters. For subsequent use, I extracted from this a simple list of component numbers and characters and saved it in the file letter_id.
COMPONENT 6 n_pts: 404 x_min: 824 x_avg: 833.869 x_max: 844 y_min: 23 y_avg: 46.7104 y_max: 70 sd_xx: 0.0257086 sd_xy: 0.0165955 sd_yy: 0.563738This output has been saved as the file letter_stats for use in subsequent analysis. To convert the dimensional quantities into physical units, I looked at the original book and measured that the em is 3/16 inches long. Looking at the data, this corresponds to 60 pixels, so each pixel is 1/320 inch wide. Thus, for instance the character described above as 20 × 43 pixels is 1/16 in wide and 43/320 in tall.
To begin, I plotted m00 (n_pts in the data file), smoothing with
a bandwidth of h=12.0:
Looking at this plot, we notice multiple peaks, presumably corresponding to the different characters. Since there is considerable overlap between these peaks, looking at this quantity alone won't suffice for more than a rough separation into small, medium, and large characters.
Also, there is the spike around 0. Looking at the annotated file, the clusters corresponding to this spike are dust and speckles, not characters so are uninteresting and can be eliminated by ignoring components with less than 30 pixels. Henceforth, this will be done.
We next plot the quantites sxx, sxy, and syy
using a bandwith of h= 0.005:
Again, we see multiple peaks, so this looks promising.
The next order of business is to run the data past the simple Ray algorithm
to identify these clusters. This algorithm is implemented in the program
srclust.awk. A cluster diameter
of 22 seems to give the best result; I saved the output of the program in the
file xx_yy.ann.clust.22 and plotted it
(omitting clusters of three or less points for clarity):
The separation of clusters 0, 1, and 20 came out oddly because they are so close together. The separation of clusters 2 and 6 looks arbitrary; to the eye, it looks like one big cluster, but the algorithm had to split it up into two somehow because it exceeded the maximum diameter. Thus, I next fed the output through the k-means algorithm (implemented in the program kmeans.awk, output in xx_yy.kmean.22) to see what that would do. The result looks as follows:
The seperation of the clusters near (75, 100) looks better. We still have what look like big clusters split into several little clusters. To deal with this, I decided to combine clusters, making one big cluster out of clusters 2, 6, and 13, another out of 0 and 1, and yet another out of 20 and 21. After running kmeans again with the centroids of these combined clusters as initial data, I saved the result as xx_yy.kmean.17 and plotted it to obtain the following:
Identifying the letters, we find the following:
|Cluster 0:||1 b, 25 a, 45 e, 9 c|
|Cluster 1:||1 fl, 13 p, 17 d, 20 h, 2 V, 3 k, 7 b, 8 g|
|Cluster 2:||12 u, 1 d, 2 scanno, 24 n|
|Cluster 3:||1 scanno, 6 comma|
|Cluster 4:||1 h, 6 w|
|Cluster 5:||2 dash, 29 dot, 4 scanno|
|Cluster 6:||11 f, 1 j|
|Cluster 7:||12 l, 1 H, 1 I, 1 scanno|
|Cluster 8:||9 m|
|Cluster 9:||24 i, 3 scanno|
|Cluster 10:||1 l, 1 scanno, 2 I|
|Cluster 11:||3 M|
|Cluster 13:||1 e, 37 o|
|Cluster 14:||22 r|
|Cluster 15:||1 a, 20 s, 4 v|
|Cluster 16:||36 t|
|Cluster 17:||5 y|
A naive implementation of this distance function is found in the program blobminmax.c which simply loops over both sets minimizing and maximizing. While this takes considerable time when both input sets have around 500 members (as is the case with our data) which could be saved with more clever programming, it is still workable for the sort of preliminary exploration we are doing here.
Since the identity of a character doesn't depend on it's position on the page (of course, it's meaning does, but that is not being considered here) we need to take that into account. One way to do this is to minimze the distance over all translations of the two point sets relative to each other. Another is to align the centroids of the two sets. Trying out a few examples, I noted that the two methods give nearly the same answer when dealing with similar sets, such as two instances of the same letter and hence will use the second method because it is easier. To that end, I wrote the routine centre.awk to translate letters so that their centroid is at (0,0).
We will use this distance function to refine our clustering. Since it's
such a jumbled catch-all, let's start with cluster 1. Computing the distances
between pairs of characters in that cluster, ordering them, smoothing out
the result and plotting it produces the following graph:
Note the peak between 0 and 5. This looks like it should correspond to the distances between different occurences of the same character, so suggests what sort of diameter to use for clustering. Running the Simple Ray algorithm with diameter 6 resolves this cluster into seven subclusters, one for each type of character. Encouraged by this success, let us have a look at the remaining clusters. Applying the same algorithm with the same value of diameter, we obtain the following: