The state of the art in text analytics applications

CrackSmokeRepublican · March 08, 2011, 11:28:29 PM

December 1, 2010
The state of the art in text analytics applications

Text analytics application areas typically fall into one or more of three broad, often overlapping domains:

* Understanding the opinions of customers, prospects, or other groups. This can be based on any combination of documents the user organization controls (email, surveys, warranty reports, call center logs, etc.) — in which case — or public-domain documents such as blogs, forum posts, and tweets. The former is usually called Voice of the Customer (VotC), while the latter is Voice of the Market (VotM).
* Detecting and identifying problems. This can happen across many domains — VotC, VotM, diagnosing equipment malfunctions, identifying bad guys (from terrorists to fraudsters), or even getting early warnings of infectious disease outbreaks.
* Aiding text search, custom publishing, and other electronic document-shuffling use cases, often via document augmentation.

For several years, I've been distressed at the lack of progress in text analytics or, as it used to be called, text mining. Yes, the rise of sentiment analysis has been impressive, and higher volumes of text data are being processed than were before. But otherwise, there's been a lot of the same old, same old. Most actual deployed applications of text analytics or text mining go something like this:

* A bunch of documents are analyzed to ascertain the ideas expressed in them.
* A count is made as to how many times each idea turns up.
* The application user notices any surprisingly large numbers, and as result of noticing pays attention to the corresponding ideas.

Often, it seems desirable to integrate text analytics with business intelligence and/or predictive analytics tools that operate on tabular data is. Even so, such integration is most commonly weak or nonexistent. Apart from the usual reasons for silos of automation, I blame this lack on a mismatch in precision, among other reasons. A 500% increase in mentions of a subject could be simple coincidence, or the result of a single identifiable press article. In comparison, a 5% increase in a conventional business metric might be much more important.

But in fairness, the text analytics innovation picture hasn't been quite as bleak as what I've been painting so far. While standalone, passively-reported text analytics is indeed the baseline, there are some interesting exceptions. For example:

* I once confirmed that SPSS customer Cablecom's statistical models for churn and the like absolutely included text data; Cablecom even assigned different weights to the same apparent level of emotion depending on whether the text was in German, French, or Italian. Vertica recently told me of a Vertica/Hadoop customer doing something similar, except for the multilingual aspect. And the end of a 2008 SAS-based paper makes similar claims.
* There long* have been some examples of fact extraction that don't really fit into my three buckets above. For example, researchers mine collections of articles to try to determine biochemical or biological pathways that would not be apparent from examining single research studies alone.
* It also has long* been the case that some bad-guy-finding applications — especially in the anti-terrorism area — used text analytics to populate state-of-the-art graph-oriented data analysis tools.

*When it comes to text analytics, "long" means "at least for the past several years."

In more recent examples:

* Greenplum built a document recommender for law firms that does hard-core statistical analysis to determine which .1% of a document set lawyers might actually want to see, and which then learns from users' feedback after they respond to initial result sets.
* Information extracted from investment news gets included into automated trading algorithms. This was unusual technology a couple of years ago, but is more common today.
* After a series of mergers, Attensity now uses marketing-oriented text analytics in at least three different ways:
o Attensity text analytics feeds marketing dashboards just as it always did.
o Attensity text analytics triggers alerts, as I wish dashboards and business intelligence tools more often did, the false positives problem notwithstanding.
o Attensity text analytics triggers concrete workflows, for example routing specific social media hits for priority response.
o And in one example that did not actually get into production, a very large social networking company correlated word usage (e.g., choice among different synonyms) against user characteristics such as age and gender.

Finally there are some applications that, while fitting the standard template, just strike me as getting to unusually sophisticated levels of analysis. For example, Vertica told me of another Vertica/Hadoop case where VotM document analysis is carried out to the level of observing which order brand names appear in, and adjusting that for whether or not it was just an alphabetical list.

I suspect text analytics is about to become more interesting again.

http://www.texttechnologies.com/2010/12 ... /#more-443

The Info Underground

The state of the art in text analytics applications

CrackSmokeRepublican