Scoping a Big Data Implementation
Posted on : 26-09-2012 | By : richard.gale | In : Data
Tags: ACID, BASE, Big Data, CAP Theorem, Complexity, Financial Services, NoSQL, Operations Management, SQL, Stonebraker
1
Big Data or, “Too Much Data?”
As Big Data starts to peak in the hype cycle the scramble to capitalize on the trend begins in earnest. There are some basic frameworks and checklists that can be used to get an overview of what “Big Data” is, what it could mean in your case and whether it should mean as much as it does at the moment. Is Big Data getting too big for it’s own boots?
Big Data, or, “What problem are you trying to solve?”
Implementing a Big Data Strategy is no different from any other technology implementation – regardless of what the proponents of data management technologies will tell you. As before, you will have a set of business use cases that will determine the technology implementation.
The only real difference is a different view on the data. Before “Big Data”, the assumption was that most accessible data was essentially relational and that the only pertinent questions were, Which vendor will we choose and Do we really need the Enterprise version?
With the growth of data, both human generated content (Web 2.0) and machine generated, new technologies that can cope with the new data challenges have emerged.
This paradigm – under the moniker of Big Data – includes a method for understanding how to frame your data. This is known as either the “Three V’s”, “Four V’s”, or “Five V’s”.
It is rarely acknowledged that these terms were mostly lifted from the field of operations management. In this discipline the four “V’s” of Volume, Variety, Variation and Visibility help understand differences between operational activities:
- Volume: hamburger production or limited edition mugs?
- Variety: Bespoke taxi journeys or LED televisions?
- Variation (or elasticity of demand): Customer demand in ski lodges is clearly seasonal; cigarette demand less so.
- Visibility: the level of customer contact.
Modified for the challenges of Big Data the V’s ask us the following questions:
a. Volumes:
How much data will your current data set consist of? Against how many 100s of Gigabytes, Terabytes or Petabytes will you be querying? How will you determine the volumes you expect, calculation or measurement and extrapolation?
b. Velocity:
What ingest rates do we expect? How quickly must we absorb the data to make it ready to use? How many calls to the DMS can we expect? How will you measure current demand and what are the assumptions behind future expectations?
c. Variety:
What kind of data do we have? Do we have fully structured, relational data? Do we just have a scalability problem with our current RDBMS? Or do we have sparse/semi-structured data? Or, are we planning something like distributed file storage? How are we going to model the data? Do we want to to enforce a schema prior to persistence or enforce structure on read?
d. Value
How valuable is the data? Is it high-value transactional data (my bank account) that is needed to run the business? Or is it something we need to store to satisfy regulatory requirements? This is the value of the data prior to ingestion – what about digested data? How valuable is the data post-processing? What are you going to do with the data?
e. Variation:
Another “V” operations management offers, but rarely seen in the Big Data discussion, is Variability – will our data have seasonal variability, will it change based on specific events? i.e. will we have a requirement for “burstable” scalability and will we therefore require elastic scalability? Will we need a cloud-based capacity provider?
Understanding where the problem sits by using this frame will be a direct outcome of a detailed understanding of your specific use case. Michael Stonebraker has claimed there could be a data management solution for each use case; like Facebook, and Google, it might be that your data and data requirements are so specific you need to develop your own data management solution…
BASE vs ACID or, “What are your consistency requirements”?
A key debate that emerged with the growing use of distributed data management systems is the CAP theorem. The CAP theorem, also known as “Brewer’s Theorem”, states that a distributed data management system cannot be Consistent, Available, and Partition tolerant at the same time. You must forfeit one – and therefore accept a weaker consistency model – if you wish to significantly scale out:
- Consistent: Every node has a consistent view of the data
- Available: Every request receives an acknowledgement
- Partition Tolerance: The System continues to operate despite message or data loss.
The more familiar ACID, on the other hand, states that any transaction in a database system must be:
- Atomic: All the transaction succeeds or nothing at all
- Consistent: Database must be consistent post transaction
- Isolated: Transactions do not interfere with another
- Durability: Transactions are persisted.
The NoSQL movement has made heavy use of the BASE consistency model:
- Basically Available: more or less there.
- Soft-state: effectively the Time-to-Live of a datum
- Eventually Consistent: not all nodes are necessarily immediately consistent; this is maybe tolerable for social media “likes”, perhaps less for debit and credit transactions…
There will therefore be the requirement to take a position on BASE vs ACID in your design and where you want to sit on the continuum; this will mainly be a factor of the Value of the data (at both ingestion and digestion time). Essentially, the question to ask is how much ACID can be sacrificed and if so at what cost? Which of Consistency, Availability, or Partition Tolerance will you forfeit?
Understanding the Families, or, “What is your data model?”
An outcome of a more detailed understanding of the data – the Variety – will provide us with an indication of the data our management system will house and therefore also any requirements for data models. It may well be that our data is suitably modeled via Key-Value; it may well be that a document store is more suitable as it provides repository that allows for flexible modeling and adjustments later (i.e. we are not totally clear on the data). Perhaps our use case is best modeled by graph technologies? It may well be that we are still subjected to the relational paradigm and we cannot sacrifice any form of ACID, in which case, subject to our scalability requirements, we may need to be looking at “NewSQL” offerings?
The future, or, “When is Big too Big?”
Beyond whether we have a big data problem, it might be whether considering if we want one.
Do we have policies on use of private data? Do we archive too much business data, beyond regulatory requirements? Do we archive too much machine-generated data? Do we purge and expire datasets? Do we really expect to derive value from large datasets? What do we have to store to satisfy legal requirements, what do we need to store to operate the business, what do we want to store to understand our business and finally what do we think we need?
Whereas the arguments that commoditization does allow for cheaper storage and analysis of large datasets are valid and sound, this should not be a carte blanche to store anything and everything. In all the hype around Big Data, it’s probably worth bearing in mind that a high percentage of the data stored will never be looked at again let alone used productively. The impact of these unused, untapped and stagnant pools of data on the bottom line – and the environment – should not be overlooked.
Guest author: Damian Spendel – Damian has spent his professional life bringing value to organisations with new technology. He is currently working for a global bank helping them implement big data technologies. You can contact Damian at damian.spendel@gmail.com