Think you know what optical character recognition’s for? As Drew Turney discovers, you ain’t seen nothing yet…
To understand the story of optical character recognition (OCR) is to appreciate one of the fundamental differences between computers and people. To us, everything’s a symbol — whether you see the word ‘cheese’ in a newspaper or on a three-storey billboard you know the ‘ch’ is a symbol for forcing a short burst of air out of your mouth while moving the tip of your tongue off your palate. As biologists and computer scientists both understand, humans are pattern-recognition machines.
To a computer, there’s text and there’s pictures. Text can be reformed and repurposed ad infinitum — you can type words into an email that appear in a magazine across the world without anyone having to retype it. But pictures are fixed and absolute because the content is immaterial, it’s just the expression of numerical values. Bridging that gap is the challenge of OCR.
It’s actually been around since before computers — the first patent (for a cumbersome photographic system) was granted in Germany in 1929. By 1965 the US Postal Service was using computerised OCR to sort mail. Many of us first came across it when compact scanners in the early desktop publishing era could take a digital copy of a typewritten (or neatly handwritten) page and OCR could extract the text for editing. But now, OCR is taking full flight.
Tomorrow’s world in your hand
Forget space shuttles, nanobots and large hadron colliders, the yardstick for human technological development may just be the Star Trek tricorder. Convention geek jokes aside, a good deal of 20th century sci-fi reveals an innate desire for a platform- and information-agnostic device we can point at anything to learn all about it (for when we visit alien worlds).
Closer to home, such a device is coming to fruition, and OCR’s a critical part of it. A confluence of factors have combined to put it right in your pocket. First is the smartphone, a fully featured computer in itself. Second is its embedded imaging device — higher quality than even the first standalone digital cameras. And third is our mobile and wireless networks, which doesn’t just mean the world’s information at your fingertips on mobile-optimised websites — they offer access to all the computing power the cloud offers.
Google is bringing both new and existing technologies together to lead the way in (dare we say it) OCR 2.0. Goggles (available for Android or iOS) is the online giant’s tool to search the web using a picture. It actually works for landmarks, landscapes, artworks and other text-free imagery as well, comparing your picture of Uluru or van Gogh’s Starry Night with the billions of others across the web. Using the same heuristic margin of error (see The AI of Text), Goggles does the calculation in the cloud and can tell you what you’re looking at. When it comes to text, the OCR’s also done in the cloud, and the result is as fast as an everyday text search.
And as you can imagine, the applications are endless. Take a snap of the ISBN on the back of a book and you can go straight to the Amazon store and buy it for your Kindle. A photo of the label on your wine can take you to the growing region that produced it on Google Maps. Because Google Translate now works with 63 languages, one of Goggles’ coolest uses is to translate signs and directions to get around a foreign city.
All we need now is a chemical analysis port so you can be sure you’ve landed on a class M world before disengaging your space suit…
The AI of Text
Because they’re pattern-blind, abstracts like ‘strange’ or ‘something that doesn’t belong’ mean nothing to a computer. They need an example of what they’re looking for, which is what makes virus detection so tricky while the bad guys constantly work to keep their software from looking like older nasties.
In ‘flat’ computing, a single line or character of difference would be enough to get past your antivirus utility, so the cybersafety vendors know they need to program some room for error. Let’s imagine a virus can be expressed as ‘12345’. You PC can be told to treat a process containing those five digits in a different order as suspect, or a combination of just two or three of them.
It’s called heuristics or experience-based programming, and OCR works in a similar way, by pre-loading known values but giving the software room to account for variables. OCR software refers to a library of bitmapped patterns that tell it about common font attributes (line height, serif length, stroke width, etc). Simple variables are expressed in rules like consider this figure a ‘g’ if the line that corresponds to the left-hand bowl is curved and anywhere between 1 and 4 pixels wide. And the commonalities between even the wackiest typefaces (without them we wouldn’t be able to make them out either) do the rest.