Information Today, Inc. Corporate Site KMWorld CRM Media Streaming Media Faulkner Speech Technology Unisphere/DBTA
PRIVACY/COOKIES POLICY
Other ITI Websites
American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Streaming Media Producer Unisphere Research



News & Events > NewsBreaks
Back Index Forward
Threads bluesky LinkedIn FaceBook Instagram RSS Feed
 



Search and Business Intelligence: The Humble Inverted Index Wins Again
by
Posted On October 14, 2010
Click here for full-size image
Click here for full-size image
Click here for full-size image
Click here for full-size image
Click here for full-size image
Click here for full-size image
In this modern age, big institutions have giant piles of data about all their operations: The question is what to do with all those bits. Extracting the right information can help avoid waste, delays, systems failures, even terrorist threats. For example, look at Toyota’s customer support and repair data: If the management had been looking, they would have noticed that something was going terribly wrong. Business intelligence (BI) means mining through all that digital data—in legacy systems, databases, and even spreadsheets—and reporting what’s going on. This generally requires creating aggregations that need server farms with big hard disks and lots of memory. But text search engine technology, using sophisticated versions of inverted indexing, can create files that are effectively shadow databases in much less space, optimized for fast retrieval. These search/BI hybrids also provide sophisticated access to the contents of text fields, making customers very happy indeed.

Institutions with all that data know its value: Managers need to see regular reports about their core functions, such as flight data and per-store sales, as well as on outsourcing and suppliers. They want to explore interactively, get custom views of the data, and even use the tools to predict trends over days, months, and years. A big barrier is combining disparate sources, such as adding a point-of-sale database and a Twitter feed to sales analytics. And, as with most things, people start expecting more and more: Each dashboard or report can be very useful. But information needs are not static, and BI should be prepared to change on demand.

When implemented well, BI can be a huge success. York Manufacturing used BI analytics for its wall-covering business to “guide manufacturing decisions, reducing production waste by 45% and saving more than $2.4 million USD. York has also realized additional savings such as reducing machine changeover and setup times by 50%.” These kinds of opportunities apply to retail, financial services, libraries, and, most obviously, government intelligence agencies.

BI has grown organically along with databases. Many BI products are based on relatively old technology, SQL (Structured Query Language) for relational databases. However, these databases are designed for transactions and simple queries, while the content gets more and more complex. OLAP (Online Analytical Processing) applications extract combined data sets for inspection, performing a prefiltering function that anticipates analytical needs. This can get very resource intensive, both in hardware and effort, and traditionally the IT department has to make all the changes, often leading to big backlogs.

This leads to an opportunity: Modern text search engines can extract data to index very quickly and efficiently. They can read spreadsheets and office files, Oracle, MySQL, SQL Server, and hundreds of other file formats and data sources, and generally have a pipeline for various processes to the data, depending on the source. Then, instead of a database of tables and relations with prespecified rows and columns, text search engines store their content in inverted indexes: a list of contents with links to the source of each. These are called “inverted” in that they are in alphabetical order, rather than the sentences from the original source. That’s how Google works: Each word of each page is stored in the index, and when a user sends a search, it looks up the words in the indexes and finds the associated pages. Very large search engines—whether on the web or indexing trillions of internal transactions—require a bit more logic to distribute queries among multiple databases and merge the results. But the basic structure is still the inverted index. While some information retrieval researchers consider other indexing forms more interesting algorithmically, it turns out that the inverted format is the most effective by far. (See “Inverted Files for Text Search Engines” by Justin Zobel and Alistair Moffat, ACM Computing Surveys, Vol. 38, No. 2. (2006), Article No. 6. DOI: 10.1145/1132956/1132959.)

There is a bit more to it than that because the index has to track the structure of the original item: metadata tags, document section, or database field names. Modern enterprise search engines already do that in order to improve retrieval and relevance ranking and enable faceted metadata search and browse, based on source data structure. So the relatively low-overhead index, source flexibility, speed and quality of text search, and architectural dynamic navigation means that these vendors have a chance at the lucrative BI market.

Paul Sonderegger of Endeca describes how the company worked with an automobile manufacturer to analyze warranty claims on various auto parts. By indexing both the claims and the parts information, mixing in repair notes and connecting to suppliers, they were able to find the root causes of several parts defects and address them quickly, without the overhead of premodeling all combinations of parts on vehicles. Likewise, MarkLogic’s Walter Underwood describes a government intelligence office that digs through news, blogs, and other web content on a daily basis, writing simple XQuery commands for “ad hoc mining in the data midden.”

Attivio, Exorbyte, Coveo, Microsoft FAST Search, and others are leveraging their expertise with unstructured data, multiple data sources, speed of retrieval, and relevance ranking to provide a new and valuable approach to BI. Because there will never be less data.


Avi Rappoport is available for search engine consulting on both small and large projects.  She is also the editor of www.searchtools.com.

Email Avi Rappoport
Comments Add A Comment
Posted By Charlie Hull2/16/2011 5:52:20 AM

Great article Avi. There's also a lot of open source search technology available as well, with the advantage that scaling to very large data volumes can be far more economical as there are no license fees to pay.
Posted By Dan Nicollet Exorbyte Inc.10/29/2010 3:40:24 PM

Great post Avi and thanks for the mention of Exorbyte. I have follow up thoughts for you and your readers at :
http://blog.exorbyte.com/2010/10/now-we-can-start-building-business-intelligence/
Regards,
Dan Nicollet
MD - Exorbyte
Posted By Lindsey Niedzielski10/22/2010 7:05:40 PM

Great post Avi. This is a great resource for IM professionals. We have a community (www.openmethodology.org) and have bookmarked this post for our users. Look forward to reading your work in the future.

              Back to top