cahn

From:

You guuuuys...I must show off!

So last night I needed a break from assembling material for Rheinsberg posts, and since I was very tired, I thought some mindless OCR cleanup would do the trick.

Being me, I almost immediately decided to start seeing if I could solve the biggest problem in an automated way. And the biggest problem was that moving around of lines that I'd talked about. For instance, the following four lines:

souciant pas de sa perte et relevant toujours les assaillants de nouvelles
troupes, la garnison avait été forcée. Voilà cependant des circon-
stances que je ne saurais vous garantir, n'ayant pas de nouvelles sûres
sur cela.

were rendered by the Google API as:

souciant pas de sa perte et relevant toujours les assaillants de nouvelles
troupes, la garnison avait été forcée.
stances que je ne saurais vous garantir, n'ayant pas de nouvelles sûres
sur cela.
Voilà cependant des circon-

Which has all the right words, but half of line 2 is suddenly a new line 5. And that just seemed weird.

Well, from my reverse-engineering, it looks like Google is doing OCR as a two-step process:

1) Detecting the location of each individual word on the page.
2) Assembling the words together in text form and give them to the user.

Well, because Google is nice like that (thank you, Google!), the API actually gives you the results of both steps. In other words, I was getting, not only the garbled text printout above, but also each individual word with x and y coordinates. For example:

{ "description": "avait", "boundingPoly": { "vertices": [ { "x": 323, "y": 874 }, { "x": 368, "y": 873 }, { "x": 368, "y": 888 }, { "x": 323, "y": 889 } ] } },

{ "description": "été", "boundingPoly": { "vertices": [ { "x": 386, "y": 875 }, { "x": 412, "y": 875 }, { "x": 412, "y": 889 }, { "x": 386, "y": 889 } ] } },

{ "description": "forcée.", "boundingPoly": { "vertices": [ { "x": 429, "y": 874 }, { "x": 487, "y": 873 }, { "x": 487, "y": 889 }, { "x": 429, "y": 890 } ] } },

{ "description": "stances", "boundingPoly": { "vertices": [ { "x": 95, "y": 903 }, { "x": 160, "y": 903 }, { "x": 160, "y": 916 }, { "x": 95, "y": 916 } ] } }

And by closely inspecting the x and y coordinates of individual words that were getting returned out of order, I realized that the coordinates were correct. It was step 2, assembly of individual words, that the Google API was getting wrong.

Well, step 1 requires a team of Google-level engineers and I would never try it, but step 2 is pretty easy. You just have to sort a bunch of numbers in order to get the correct order, and then print out the words in the correct order, sans coordinates.

I did it! I now have a script that bypasses the printout from Google and constructs its own printout based on the raw coordinates.

Why is this important? Well, the detection of individual words looks good enough to me that as long as the words are in the correct order, I think I can get away with not comparing the OCR to the original text.

I still need to do manual cleanup of all 1500 or so letters, because automatically detecting things like footnotes, ends of letters, paragraph breaks , etc. is hard, and I'm going to have to go through and make judgments as to which text I'm interested in passing to the translate API and which text I want to discard.

But the key point here is that I can put all this in one file and eyeball it, and I do not have to open 1500 image files and compare them side-by-side to make sure the text is in the right place. If the first 5 or so letters I've tried this on are indicative, that is a solved problem.

Being able to scan through one long file and reformat things using macros is going to be a million times faster than opening 1500 images and moving my eyeballs back and forth as I try to make sure the OCR text matches the scanned text.

It's still going to be a little while before I can deliver (especially because of my backlog of

rheinsberg posts and also some new posts I want to make, omg), but at least now I'm thinking days instead of weeks.

Um, if all this goes well and I'm able to deliver these OCRed-plus-translated letters, I'm still going to request the Antinous book. :P I think I'll have earned it.