This session covers data organization and data preparation methods that support analytics. Source data organizations presented include: normalized, structured, semi-structured and unstructured. Analytical data organizations presented include: network / graph, tree, dimensional, normalized and flattened. Each of these data organizations is suited to particular types of analysis: text mining, visual analytics, regression analysis and market basket analysis.
Data structures support an overall methodology for analytics. This talk builds on the CRISP-DM data modeling methodology. It focuses on the data understanding, data preparation and analytical modeling phases. Data is prepared from its current forms such as third normal form databases and flat files into formats suitable for analytics.
The flattened data organization is specially featured. Today’s predictive analytics models such as regression, decision tree and neural network are geared toward flat inputs. This means that others structures such as third normal form databases; semi-structured messages and unstructured data must be prepared to support analysis. Further preparation of data also improves data analysis. For example, the technique of Dimension Reduction is used to find the most predictive data that should be input to predictive analytics. This session explains methods for finding correlations on input data to each other and to the predicted value.
When data is categorized as unstructured it means that the structure has not yet been discovered and documented. For example, we may think of images of peoples faces as unstructured, however, advances in facial recognition have identified approximately 99 data points that describe the face and its expressions. These data points can then be placed in a flat structure which can then be addressed by machine learning algorithms.
Thank you to the TDWI, sponsors and the audience. Microsoft provided an excellent meeting venue and the audience asked some good questions about data structures and data science.