Andy Flint of CloudTech recently wrote, “Analytics depends on data — the more, the merrier. If we’re trying to model, say, the behaviour of customers responding to marketing offers or clicking through a website, we can build a far stronger model with 10,000 samples than with 100. You would think, then, that the rise of Big Data and its seemingly inexhaustible supply of data would be every analyst’s dream. But Big Data poses its own challenges for modeling. Much of Big Data isn’t what we have historically thought of as ‘data’ at all. In fact, 80% of Big Data is raw, unstructured information, such as text, and doesn’t neatly fit into the columns and rows that feed most modeling programs.”

He continues, “Using text mining algorithms, we can scour large text data sources to find associations between words and outcomes. It’s perhaps easiest to look at a case of supervised modeling, where the outcome we are modeling for is known — for example, credit fraud or customer profitability. We use that outcome to direct the search for repeating terms (words or phrases) that have real signal strength — that is, they are often present in records associated with one side of our possible outcomes. We then build up a term document matrix, which lists all the unique terms in the corpus of text we are examining, across all the cases (or documents) in the analysis. Among this often enormous list of terms, which terms are used most frequently? More pointedly, which terms appear most reliably in connection to one outcome (e.g., purchases of product X) versus the other (no such purchases)?”

Read more here.

Image: Courtesy Flickr/ Paul L Dineen