Big Data Strategy: A pragmatic approach

Developing a Big Data Strategy

Is a Big Data Strategy a business initiative or an IT initiative?  It is clearly a technical solution to address, not only new data sets available to business but more importantly, to support new business approaches to leveraging data as an information source.  For decades, businesses have been trying to make sense of disorganized transaction data sprinkled throughout the enterprise, relying heavily on humans resources to analyze its data via data warehouses.  Steps traditionally include collecting, cleaning, conforming, consolidating and organizing the data for business analysts to perform Ad-hoc queries to answer key business questions such as “How are my sales, by region?” or “How is my inventory, by product?”  Of course, to query data, it must be structured in a data warehouse and browsed via Business Intelligence tools, correct?  Well, not anymore.  Big Data has changed the rules.

Big Data can be defined as a system that collects and analyzes data that has either already been structured by corporate IT or in its raw, unstructured format.  The key advantage of big data is that there are no rules on how the data must be organized, therefore it’s primary value is its ability to correlate and analyze any piece of data to any other, and detect trends in these correlations, known as behavior patterns.  Now that data nirvana has been introduced, let’s discuss reality:  It’s absolutely true that the big data paradigm does not require preconceived, modeled structures when storing data. However, to reveal correlations and behavior patterns in data from multiple disparate data sources, technical preparation and discipline is still required.  Before embarking on a new big data project, a policy for handling this data must be in place.  There are four steps to a big data strategy:

1. Initiate

  • Determine business objective and success criteria – The business must have a use case with a measurable requirement i.e. Increase sales by recommending targeted products in real-time based on customer peer behavior.
  • Planning – Big data project requires sophisticated orchestration as it disruptively introduces a new hardware, software, resources and data sets.  It will involve technical toolsets never experienced by the business or IT, and bring together data sets never before integrated. New policies, procedure, training and project planning must be carefully provisioned.
  • Address compliance, privacy, and security – Data privacy is a growing concern among those lacking data literacy; it’s driven by fear.  The fact is, corporations know who their customers are and understand their behavior patterns.  The big data paradigm is merely allowing companies to determine customer behavior with better accuracy.  As data is propagated between companies, i.e. information from social media, regulatory compliance requirements must be strictly adhered. For example, the data must be anonymized, meaning all identifiable information such as name, street, phone number, etc. must removed.

2. Integrate

  • Collect all relevant data points – Big data solutions include data warehouse data, raw transaction data, and unstructured log data.  A financial big data solution may have trade data, market data, position data, news feeds, customer reference data, weblogs and system logs.  Repeatable processes must be established for the consumption of each data source. Techniques inherited from traditional data warehousing such as change-data-capture, micro-batch processing and real-time data streaming still apply.
  • Build Infrastructure and Interfaces – Big data has become synonymous with Hadoop.  Regardless of specific technologies – which are changing even as this paper is being written – a new set of technologies will need to be installed and configured.  The decision must be made to build your big data solution on the Cloud or On-Premises.  The accountants will have as much to say about this decision as the technologists as at the time of this writing Only On-Premises solutions can a Capitalized Expense (CapEx) and depreciated as an asset, while the cost of Cloud solutions are recorded as an Operating Expense (OpEx) and may have an adverse effect on your balance sheet. Of course, building on the Cloud reduces, if not eliminates, the initial start-up investment, provides almost immediate readiness to begin the project and allows reasonably infinite scalability.
  • Prepare data for analysis – The power within the big data paradigm is its intrinsic ability to process machine-learning algorithms on data comprised from any data source, structured or not. However, as with any mathematical equation, specific variables must be provided.  Similar to traditional data warehouse ETL (extract, transform & load) processes, data within the big data platform usually must be read, transformed and written back to storage so it can be consumed for statistical and algorithmic analysis.
  • Data cleansing, data mastering – Contrary to some beliefs, this requirement does not go away!  If the big data paradigm is to become the new corporate analytics platform, it must be able to align customers, products, employees, locations, etc. regardless of the data source.  Moreover, known data quality issues that jeopardized credibility of data analyses before big data, will have the same impact on big data analytics if not properly addressed.

3. Optimize

  • Data Scientists – The paradigm shift to big data introduces a new role in the corporate organization, the data scientist.  This role requires deep understanding of advanced mathematics, system engineering, data engineering and domain (business) expertise. In practice, its common to utilize a data science team, where statisticians, technologists and business subject matter experts collectively solve problems and provide solutions.
  • Perform analysis – Because the variety of data now available and the volume of data we have now have the ability to process, the approach to data analysis in a big data environment greatly contrasts traditional methods.  Techniques such as machine learning, A/B testing, cluster analysis, natural language processing, predictive modeling, sentiment analysis, time series analysis and spatial data analysis are all now available via algorithms provided in open source libraries. It’s the responsibility of the data scientist to determine appropriate techniques and which algorithms to use to resolve specific business questions.
  • System Refinement and Tuning – Every big data strategy must include continuous monitoring and maintenance of the technical solution.  As data volume and analytic requirements increase, the configuration of the solution must evolve and grow. The distributed system will need to have nodes added, data redistributed/balanced, replication adjusted, and the configuration for all of the above, continuously fine-tuned for optimal performance.

4. Leverage

  • Prepare for culture shock – Before a big data project is launched, a strategic readiness test should be performed to assess the adoption of the new paradigm.  Business analysts will need to be retrained or repurposed.  The goal of shifting to a big data platform may include changing from reactive analysis (did that campaign work?) to proactive (what should our next campaign offer?), because now we can proactively influence non-buyers to follow behavior patterns of loyal customers; or re-stimulate active customers when their behavior pattern begins to look like a lost customer.
  • NoSQL – The big data store (HDFS) is batch oriented and not designed to respond to interaction like traditional relations databases and business intelligence tools.  To accommodate near real-time interaction, a new database class has been born: NoSQL (stands for ‘Not Only SQL’). These databases, usually live within the big data platform and come in a few varieties:  Key-Value, Column-Oriented, Document-based, and Graph databases.  NoSQL Product/technology assessment and selection should be part of any big data strategy.
  • Disseminate analyses – Notice we’re disseminating analysis, not data.  Again, the shift from humans looking at atomic level to try to observe patterns, to using algorithms to detect patterns and predict outcomes, must be communicated to the users – some will adapt, some won’t, and for this reason, traditional data warehousing and business intelligence will persist. End user big data analysis tools must natively produce map reduce jobs, utilize pig or hive, or interact with its own in-memory database.
  • Make informed decisions – Now armed with a complete big data ecosystem, including recommendations created by data scientists, its possible to close the loop – feed the results of the analysis into the engine that creates the customer experience: Your website, marketing department, sales force, product development and customer service.  Moreover, the big data machine can now consume recommendations provided as a result of its analytics correlated to new customer behavior patterns and quantify its effectiveness.

As with any new initiative, there is risk when implementing a new big data strategy. A tool, a language or a platform alone does not make a solution. Follow the four big data strategy steps for success: Methodical adherence to best practices to Initiate, Integrate, Optimize and Leverage your big data to accomplish a business objective will ensure project success.