mildred_of_midgard: (0)
mildred_of_midgard ([personal profile] mildred_of_midgard) wrote in [personal profile] cahn 2020-01-17 03:35 am (UTC)

Re: Fritzian library

Progress on OCR APIs is good. I have a proof of concept, and I don't think it would take much more work to get the full set of images submitted in an automated way to the API.

The cost appears to be $1.50, which is more than reasonable.

Now the tricky part is the output of 1,678 images that would need to be manually inspected and cleaned up before being fed to Google Translate. The OCR quality seems pretty good for individual words, but it tends to move entire lines around, and of course there's a lot of extraneous text (footnotes and such) that you don't want to feed to Google Translate, and you'd have to correct some of the formatting by hand. Stuff that I could automatically do when the pages had been converted to text conveniently all marked with html tags that my code could detect.

I'm now debating whether I want to do that much OCR cleanup by hand. Convince me, guys.

Btw, that's 1541 images for Heinrich, 107 for Ulrike, and 30 for Peter III (because why not), between late 1761 and early 1782.

...Yeah, he wrote to Heinrich a lot.

Post a comment in response:

If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting