|Weekly News Digest
September 2, 2010 — In addition to this week's NewsBreaks article and the monthly NewsLink Spotlight, Information Today, Inc. (ITI) offers Weekly News Digests that feature recent product news and company announcements. Watch for additional coverage to appear in the next print issue of Information Today. For other up-to-the-minute news, check out ITIís Twitter account: @ITINewsBreaks.
CLICK HERE to view more Weekly News Digest items.
IBM and the EU Collaborate on Digitization of Historic European Texts
IBM and the European Union (EU) have unveiled a new initiative called IMPACT (IMProving ACcess to Text). The project seeks to provide technology that will enable highly-accurate digitization of rare and culturally significant historical texts on a massive scale. It will use “crowd computing” to verify and correct OCRed text.
The latest move is seen to expand the research collaboration of the EU and IBM, which now includes more than two dozen national libraries, research institutes, universities, and companies across Europe. Unlike past digitization projects where the result has been static, online libraries of texts, this unique wide-scale effort will offer new tools and best practices to institutions across Europe that will enable them to efficiently and accurately continue to produce quality digital replicas of historically significant texts and make them widely available, editable and searchable online.
Funded by the EU, IMPACT's research combines the power of new innovative web-enabled adaptive optical character recognition (OCR) software with crowd computing technology—a fast growing concept designed around individuals, or crowds, enhancing a process or product by sharing their knowledge and expertise to dramatically improve its quality and efficiency. Combined, these technologies will allow institutions for the first time to adapt digitization to the idiosyncrasies of old fonts, anomalies and even vocabularies—while reducing error rates by 35% and substitution rates by 75%.
While today’s OCR engines perform well with modern printed texts, the faded ink, age, and unusual shapes of older typefaces can lower recognition rates by up to 50% and require massive manual post-production review. Consequently, for large-scale projects such as this, the efficiency of post-production review of digitized text is crucial.
At the core of the digitization project lies a new, unique collaborative correction system, designed by IBM researchers, that makes it simple and convenient for large groups of volunteers spread over the continent to verify the accuracy of processed texts and correct recognition mistakes using an online web system. Moreover, inherent in the system is the ability to learn from its recognition errors, and adapt automatically to the specific font’s characters.
Send correspondence concerning the Weekly News Digest to NewsBreaks Editor