April 26, 2004 — Melingo, Ltd. (http://www.melingo.com), a company that has provided advanced search capabilities for complex languages, has just introduced Morfix CL, its English-Arabic-English Cross-Language Search with Embedded Translation. What that means is that English-speaking researchers can search through Arabic material without knowing any Arabic at all—and see a results page with a translation of each Arabic word or phrase. Melingo, a subsidiary of Encyclopaedia Britannica, Inc., is carefully positioning its Morfix technology as a complement to other search engines. The company is concentrating its efforts on aiding the search process and not on highlighting the process of machine translation, which it says is still very inaccurate. Melingo claims that Morfix CL represents a breakthrough in Arabic language analysis and a boon to intelligence agencies and businesses, which today process growing amounts of Arabic data with limited numbers of qualified human translators.Morfix—MORphologyFIX—is named after morphology, the science that deals with how words change their forms. It is the changing of forms that bedevils most search engines. Semitic languages, such as Hebrew and Arabic, have extreme morphological complexity and are thus notoriously the most difficult to search, according to Yoni Neeman, founder and CEO of Melingo. Neeman is an expert in natural language processing (NLP) who has worked on unraveling the complexities of Semitic languages since 1989. The company has offered its Morfix search capability for Hebrew, and, due to the increasing demand for handling Arabic content, has now introduced the new Morfix CL for Arabic.
While words in English might have 5 or 6 forms or inflections (word stem alternatives), Arabic words could have up to 10,000 inflections per word. As Neeman explained it, "words and articles are kind of glued together in Arabic." In addition, both the Hebrew and Arabic languages use spelling systems that neglect to include vowels, leading to a lot of ambiguity when words are read out of context. Imagine seeing an English word written as fnd-it could mean found, find, fend, fund, fiend, or fond.
The Morfix technology platform is based on comprehensive lexical databases of meanings and a system that enables programming the complete grammar of a language into the system. "Because Morfix works with actual meanings, not just written words, it can accurately translate an English query into precise Arabic meanings and retrieve all relevant texts at lightning speed," says Neeman.
While I was skeptical about the benefit of even showing me a demo—how would I know what he was typing in the search box?—we conducted the searches in English as well as in Arabic and I was suitably impressed. The Morfix engine will search for an exact match, for the word and its synonyms (thesaurus search), for all words inflected from the same stem (morphological search), or for words derived from the same root (expanded search). The results screen presents abstracts with the search terms highlighted. For a demonstration of the cross-language English-Arabic search, see http://www.morfix.com/arabic.
It did seem a little kludgy to me, mousing over single words and getting the translated text, with the possible variations for each word—and trying to discern the gist of a document. For example, an Arabic word might be shown translated as "fasten; strengthen; stabilize; establish; convict." It's just not the same as seeing a full rough translation. But, as Neeman explained, he feels that machine translation is currently inadequate to the task. He said that customers of Morfix CL would likely use it to locate relevant material and then refer the documents to human translators to render an accurate text.
Morfix can be licensed as a search engine on its own, with its own spidering and indexing modules, or as a plug-in to search engines and to databases, such as Microsoft SQL Server and Oracle. Neeman said the company has been talking with some Web search engines about possible partnerships. Pricing varies depending on size of the enterprise, number of users, and volume of data. The new Morfix CL, which offers the cross language capability, is currently only available for Arabic, though it may be introduced for Hebrew at a later date.
Melingo, Ltd., a wholly owned subsidiary of Encyclopaedia Britannica, Inc. since 2000, has its headquarters in Tel Aviv, Israel. Melingo's other products include automatic text-to-speech and text-to-phoneme conversion products, and the company is developing phonetic search products for speech search. Melingo's current customers include government intelligence agencies and business organizations.
Neeman said the company chose to implement Morfix on Hebrew and Arabic first because the benefit of using a unifying system to overcome variance and ambiguity is the greatest in these languages. But the Morfix technology platform can be applied to meaning-based searches for any language, which expands its application potential considerably. Britannica said it is seeking strategic partnerships for Melingo that would fully leverage the commercial potential of Morfix and Morfix CL both in the security markets worldwide and in consumer markets in the Arab world.
Melingo is not alone in tackling complex language barriers, and some companies are emphasizing machine translation capabilities. In December 2003, Language Weaver, an emerging software company developing statistical machine translation software (SMTS), announced the commercial availability of an Arabic to English language pair module for its automated translation product. Language Weaver's software is based on statistical machine translation research done at the University of Southern California's Information Sciences Institute (ISI). The company was founded in 2002 to commercialize the technology. One of its backers is In-Q-Tel (http://www.in-q-tel.org), a private not-for-profit venture group funded by the Central Intelligence Agency (CIA).
Other companies working in this space include Sakhr Software (http://www.sakhr.com; shown implemented at http://www.ajeeb.com; offers technologies that support FAST's enterprise search platform) and Xerox Research Centre Europe (http://www.xrce.xerox.com/competencies/content-analysis/arabic/). Basis Technology Corp. (http://www.basistech.com) offers the Rosette Arabic Language Analyzer (ARLA), a multi-platform linguistic engine that facilitates the analysis of documents written in Arabic.