LexisNexis (www.lexisnexis.com) recently released its new semantic search technology developed in alliance with PureDiscovery Corp. (www.purediscovery.com). The partnership agreement gives LexisNexis exclusive rights to implement PureDiscovery's KnowledgeGraph technology to intellectual property data. KnowledgeGraph is generally applicable to any unstructured data, including that data within corporate and governmental internal networks. While most common search engines use Boolean logic or ranking algorithms to produce answer sets, semantic search uses the science of meaning in language (semantics) to produce relevant search results. The new technology is available through the patent research and retrieval service LexisNexis TotalPatent, the automated patent application and analysis product LexisNexis PatentOptimizer, and the LexisNexis flagship online legal research service lexis.com.
The LexisNexis semantic search index covers more than 10 million full-text patent documents from the U.S. Patent and Trademark Office and journal articles and other documents from Elsevier. The semantic search function is applied to all the patent documents from authorities worldwide on TotalPatent and from 2,700-plus files on lexis.com covering Elsevier and other scientific articles; technology sources such as IP.com and Research Disclosure; and news sources.
Michael J. Hudelson, director of intellectual property at LexisNexis, described LexisNexis' new approach to semantic searching in his presentation, "Semantic Search-Opening the ‘Black Box'" at the Patent Information Users Group (PIUG) 2009 Northeast Conference, in New Brunswick, N.J., on Oct. 13, 2009 (http://wiki.piug.org/x/MYCT). Hudelson described "the semantic advantage" as being able to go beyond keyword matching by matching on the meaning of words in a user's query or source document. He noted that this could produce relevant search results that do not contain any of the query words. Hudelson acknowledged that information professionals had serious concerns about previous implementations of semantic search because of the unexpected results and the lack of transparency and control over the semantic search process. In the past, users have not been able to see how a search result was generated or to enhance or affect search results by "engaging" with the search query.
Addressing ‘Black Box' Concerns
LexisNexis has seriously addressed this "black box" perception of semantic search. Users enter search input text of up to 32,000 characters-perhaps substantial content of a target patent document. That input can be searched immediately (feeling lucky?), a process that may take several minutes, or it can be sent for semantic analysis prior to carrying out the search. The technology analyzes input sentences or search terms and creates a set of 20 weighted search terms presented as a "QueryCloud" for review and editing by the searcher. Terms can be replaced with alternative terms, and weighting may be adjusted from 4 for a mandatory concept in the search results; 3, 2, and 1 for varied prominence in the search results; 0 for an ignored concept; to -1 for a concept prohibited in search results. When the user is satisfied with the search concepts and weighting, the semantic search is conducted with the search statement corresponding to the terms of the QueryCloud. User interaction with the query may continue after the results are obtained. The searcher may revisit the initial search input or may narrow the search results further with either Boolean or semantic search criteria. In other words, specific search terms may be absolutely required or another QueryCloud may be generated with new terms for the narrowing concept.
While some of the black box nature of the semantic search technology is inherent, LexisNexis has tried to explain some of the more important aspects. Hudelson explained that the search input is analyzed by its semantic engine, known as the BrainSpace, which has "learned" from the set of 10 million U.S. patents and Elsevier nonpatent prior art sources to make relevant connections between concepts specific to patents and technical literature. The learning process continues as new patents and articles are fed to the BrainSpace regularly. This is critically important for nascent technologies whose terminology is new but that must still be searched in older prior art with earlier vocabulary. During the semantic analysis process, the engine directs to the most relevant of the 19 technology subject "brains" with the task of creating the QueryCloud. Each brain has learned from a technology library that is almost infinitely scalable and can include up to 6 million documents.
The process of ranking documents uses the following criteria, as provided to me by Jason Penrose, intellectual property specialist at LexisNexis:
- The frequency of a given word/concept is squared and then multiplied by the weight of that word/concept.
- The frequency of terms relative to other terms in a given document is scored for each term/concept.
- The uniqueness of the term/concept relative to all other words/concepts in the patent database is scored.
Search results are presented in "relevancy" order, although LexisNexis does not provide a generally meaningless numerical relevancy rating, which is sometimes indicated as a percentage on other ranking systems. Penrose says there is high user satisfaction with the semantic search; although some users have expressed that they do not want or need transparency as long as the search process works for them.
Why Didn't I Find Known References?
I carried out some initial searching with the semantic search function in TotalPatent. The interface is easy to use, and the various aspects of query input, search term analysis, and results review and further narrowing proceeded as anticipated. On the other hand, the actual search results surprised me, as they had in other semantic search engines. I described a test of PatentCafe ("Freedom-to-Operate Patent Searching: My Six Basic Rules." Searcher: The Magazine for Database Professionals, Vol. 16, No. 5 (May 2008): pp. 34-39; www.infotoday.com/searcher/may08/index.shtml) in which I could not retrieve a known relevant patent by inputting the text of the first five paragraphs of the "Detailed Description of the Invention" of that patent. I did not retrieve the known patent with TotalPatent semantic search, but the advantage with TotalPatent semantic search was that I could see why the target patent was not found: Semantic analysis provided a set of search terms that narrowed in on the physical properties and analytical methods to the exclusion of terms for the material or application. These results might have been very helpful under other circumstances, but they were not helpful for the particular search I was conducting.
I discussed the matter of not retrieving the initial model document with Peter Vanderheyden, LexisNexis vice president of global intellectual property. He said that this was more common than one might think, but the retrieved references must have been written semantically consistent with the input text. The key for searchers is to evaluate the first few top-ranked documents, turn on hit-term highlighting, learn why the system retrieved those documents, and then manipulate the system accordingly. He has worked with clients who discovered useful references with modest changes in the search input.
I asked Vanderheyden about eliminating duplicate records found by both Boolean and semantic searching on TotalPatent and was pointed to an elegant alternative. One may carry out a Boolean search and then narrow the results with semantic search. Even though the text box for the narrowing search criteria is miniscule, one can paste a full 32K characters and carry out the full semantic search on the Boolean subset. Vanderheyden told me about two further pending enhancements: allowing a term to be mandatory without having high weighting and displaying the QueryCloud in tabular format. The company is working on many other client-requested improvements for its next version of the product. As for implementation on the other LexisNexis platform, nexis.com, Hudelson said that there is interest within the organization but no concrete plans yet.
LexisNexis Hears Customers
Glen Kotapish of PlanetPatent.com (www.planetpatent.com) says that the latent semantic analysis (LSA) search engines of TotalPatent and PatentCafe have been useful tools in his day-to-day patent search projects. He says they provide a complementary technique for filtering large amounts of information to find important documents that Boolean keyword and patent classification searching may miss. He doesn't expect the need for traditional searching to be eliminated. Kotapish considers LexisNexis TotalPatent's transparent and controllable semantic search technology to be innovative and likely to reduce users' concerns about inputting large amounts of data into a black box that produces mysterious results from an unknown process working in the background. He anticipates that semantic analysis will become an increasingly important tool to researchers in many industries.
Vanderheyden appreciates Kotapish's perspective. LexisNexis considers the searcher to be critical to the process. It's OK with LexisNexis if customers always use Boolean searching. According to Vanderheyden, many customers have said about semantic searching: "If LexisNexis gives me a black box solution, I won't buy it. And if you give me control, so much the better." LexisNexis appears to be listening well.