Scoping a Big Data Implementation

Posted on : 26-09-2012 | By : richard.gale | In : Data

Tags: , , , , , , , , ,

1

Big Data or, “Too Much Data?”

As Big Data starts to peak in the hype cycle the scramble to capitalize on the trend begins in earnest.  There are some basic frameworks and checklists that can be used to get an overview of what “Big Data” is, what it could mean in your case and whether it should mean as much as it does at the moment.  Is Big Data getting too big for it’s own boots?

Big Data, or, “What problem are you trying to solve?”

Implementing a Big Data Strategy is no different from any other technology implementation – regardless of what the proponents of data management technologies will tell you. As before, you will have a set of business use cases that will determine the technology implementation.

The only real difference is a different view on the data.  Before “Big Data”, the assumption was that most accessible data was essentially relational and that the only pertinent questions were, Which vendor will we choose and Do we really need the Enterprise version?

With the growth of data, both human generated content (Web 2.0) and machine generated, new technologies that can cope with the new data challenges have emerged.

This paradigm – under the moniker of Big Data – includes a method for understanding how to frame your data.  This is known as either the “Three V’s”, “Four V’s”, or “Five V’s”.

It is rarely acknowledged that these terms were mostly lifted from the field of operations management.  In this discipline the four “V’s” of Volume, Variety, Variation and Visibility help understand differences between operational activities:

  • Volume: hamburger production or limited edition mugs?
  • Variety: Bespoke taxi journeys or LED televisions?
  • Variation (or elasticity of demand): Customer demand in ski lodges is clearly seasonal; cigarette demand less so.
  • Visibility: the level of customer contact.

Modified for the challenges of Big Data the V’s ask us the following questions:

a. Volumes:

How much data will your current data set consist of?  Against how many 100s of Gigabytes, Terabytes or Petabytes will you be querying?  How will you determine the volumes you expect, calculation or measurement and extrapolation?

b. Velocity:

What ingest rates do we expect?  How quickly must we absorb the data to make it ready to use?  How many calls to the DMS can we expect?  How will you measure current demand and what are the assumptions behind future expectations?

c. Variety:

What kind of data do we have?  Do we have fully structured, relational data?  Do we just have a scalability problem with our current RDBMS?  Or do we have sparse/semi-structured data?  Or, are we planning something like distributed file storage?  How are we going to model the data?  Do we want to to enforce a schema prior to persistence or enforce structure on read?

d. Value

How valuable is the data?  Is it high-value transactional data (my bank account) that is needed to run the business?  Or is it something we need to store to satisfy regulatory requirements?  This is the value of the data prior to ingestion – what about digested data?  How valuable is the data post-processing?  What are you going to do with the data?

e. Variation:

Another “V” operations management offers, but rarely seen in the Big Data discussion, is Variability – will our data have seasonal variability, will it change based on specific events?  i.e. will we have a requirement for “burstable” scalability and will we therefore require elastic scalability?  Will we need a cloud-based capacity provider?

Understanding where the problem sits by using this frame will be a direct outcome of a detailed understanding of your specific use case. Michael Stonebraker has claimed there could be a data management solution for each use case; like Facebook, and Google, it might be that your data and data requirements are so specific you need to develop your own data management solution…

BASE vs ACID or, “What are your consistency requirements”?

A key debate that emerged with the growing use of distributed data management systems is the CAP theorem.  The CAP theorem, also known as “Brewer’s Theorem”, states that a distributed data management system cannot be Consistent, Available, and Partition tolerant at the same time.  You must forfeit one – and therefore accept a weaker consistency model – if you wish to significantly scale out:

  • Consistent:                                                  Every node has a consistent view of the data
  • Available:                                                    Every request receives an acknowledgement
  • Partition Tolerance:                                The System continues to operate despite message or data loss.

The more familiar ACID, on the other hand, states that any transaction in a database system must be:

  • Atomic:                                                        All the transaction succeeds or nothing at all
  • Consistent:                                                  Database must be consistent post transaction
  • Isolated:                                                       Transactions do not interfere with another
  • Durability:                                                   Transactions are persisted.

The NoSQL movement has made heavy use of the BASE consistency model:

  • Basically Available:                                   more or less there.
  • Soft-state:                                                  effectively the Time-to-Live of a datum
  • Eventually Consistent:                            not all nodes are necessarily immediately                                                                                   consistent; this is maybe tolerable for social media                                                                    “likes”, perhaps less for debit and credit                                                                                         transactions…

There will therefore be the requirement to take a position on BASE vs ACID in your design and where you want to sit on the continuum; this will mainly be a factor of the Value of the data (at both ingestion and digestion time). Essentially, the question to ask is how much ACID can be sacrificed and if so at what cost?  Which of Consistency, Availability, or Partition Tolerance will you forfeit?

Understanding the Families, or, “What is your data model?”

An outcome of a more detailed understanding of the data – the Variety – will provide us with an indication of the data our management system will house and therefore also any requirements for data models.  It may well be that our data is suitably modeled via Key-Value; it may well be that a document store is more suitable as it provides repository that allows for flexible modeling and adjustments later (i.e. we are not totally clear on the data).  Perhaps our use case is best modeled by graph technologies?  It may well be that we are still subjected to the relational paradigm and we cannot sacrifice any form of ACID, in which case, subject to our scalability requirements, we may need to be looking at “NewSQL” offerings?

The future, or, “When is Big too Big?”

Beyond whether we have a big data problem, it might be whether considering if we want one.

Do we have policies on use of private data?  Do we archive too much business data, beyond regulatory requirements?  Do we archive too much machine-generated data?  Do we purge and expire datasets?  Do we really expect to derive value from large datasets?   What do we have to store to satisfy legal requirements, what do we need to store to operate the business, what do we want to store to understand our business and finally what do we think we need?

Whereas the arguments that commoditization does allow for cheaper storage and analysis of large datasets are valid and sound, this should not be a carte blanche to store anything and everything.  In all the hype around Big Data, it’s probably worth bearing in mind that a high percentage of the data stored will never be looked at again let alone used productively.  The impact of these unused, untapped and stagnant pools of data on the bottom line – and the environment – should not be overlooked.

Guest author: Damian Spendel – Damian has spent his professional life bringing value to organisations with new technology. He is currently working for a global bank helping them implement big data technologies. You can contact Damian at damian.spendel@gmail.com

 

And I still haven’t found what I’m looking for…

Posted on : 26-09-2012 | By : richard.gale | In : General News

0

Google powers the web and helps us find what we want. It has transformed our lives and made finding information so much easier.

We now only need to plan in shorter or no timeframes.

How many times have you left the house or office without much of an idea where to go, what it looks like or even how to get there apart from your starting point. Google searches allow us to get on with our business without having to think ahead.

How many times have we heard ‘what would we do without Google’ – there are even doomsday sites with end of the world scenarios if Google stops working. In fact if you type the phrase ‘what would happen without ‘ into Google, the first auto suggestion is ‘Google’ and ‘the Sun’ comes in second…

So Google is a good thing – it can ‘do no evil’ but it can sometimes be very annoying with no real alternative:

Google is now benign but obviously commercial – it needs to make (a lot) of money so needs to ensure (a) people use it (b) advertisers pay for it.

Whilst in a monopoly position it will ensure both of these things and Google is trying every trick in the book to keep that positioning (from major acquisitions such as youtube to bringing out a multitude of new products and services in ‘beta’ mode to see which ones fly).

Monopolies are generally not good in the long term – whether they be an all-powerful government, state or organisation. Competition or at least a choice generally improves both the entity and the person. Taken to the extreme dictatorships always fail, once your clients, citizens or staff stop asking questions then you either lose them and the ability of that organisation to grow and succeed diminishes (you’ve got to be very very good to be right all of the time…) – we’ve always found collaborative teams work better than ‘command & control’ organisations in businesses we’ve worked with.

 

So Google does power the web and helps us find what we want  … unless you want to find the site for a hotel and not the top 50 pages trying to sell you cheap rooms at the same hotel.

If you want to access information from a small organisation that has a marketable product – like a hotel room – then there is little or no chance that the bed & breakfast or family hotel can compete with the large web comparison sites. They have many more marketing dollars combined with a much higher level of traffic which pushes them to the top of the Google search tree.

Other examples of this are

other sites which provide alternative searching methods:

Facebook – this is now an amazingly popular search engine – the concept of qualified, context sensitive searches that rely on your connections must be the future of information location

Twitter – almost instant feedback on this – if you have enough followers – the power of the crowd to help you make choose and decide

Amazon – OK focus is on products but  we’ve been surprised (and scared) on the ‘people also searched for’ selections are things that you are interested in – some quite bizarre!

Bing – no one uses – do you?

Yahoo – ditto

Wolfram Alpha – the perfect homework answer site (it’s not only good for this – it is really good!)

Homepageseek.com –  great at finding smaller organisations (homepage doesn’t look nice but gets good results)

Infospace – consolidated searches from yahoo, google, bing – in fact from most of the above engines

 

Google is dominant in the search world – it would be interesting to see how many home pages are set to Google.com. But the results are too many and if you’re not on the first (maybe second) page of search results then you don’t exist.

Searching is and has got to change to giving you the information you want and weaving feedback, interests and searches from your friends can do this. Google Circles is starting to do this but does it have a wide enough user base?

Let us know what you think on the future of search – we’ve only touched the surface on this (and I’ve given up trying to find that hotel).

Technology Innovation – “Life moves pretty fast…”

Posted on : 25-09-2012 | By : john.vincent | In : Cloud, Data, Innovation

Tags: , , , , , , , , , ,

0

We recently held an event with senior technology leaders where we discussed the current innovation landscape and had some new technology companies present in the areas of Social Media, Data Science and Big Data Analytics. Whilst putting together the schedule and material, I was reminded of a quote from that classic 80’s film, Ferris Buellers Day Off;

“Life moves pretty fast. If you don’t stop and look around once in a while, you could miss it”

When you look at today’s challenges facing leadership involved with technology this does seem very relevant. Organisations are fighting hard just to stand still (or survive)….trying to do more with less, both staff and budget. And whilst dealing with this prevailing climate, around them the world is changing at an ever increasing rate. Where does Technology Innovation fit in then? Well for many, it doesn’t. There’s no time and certainly no budget to look at new way of doing things. However, it does really depend a little on definition.

  • Is switching to more of a consumption based/utility model, be that cloud or whatever makes it more palatable to communicate, classified as innovation?
  • Is using any of the “big data” technologies to consolidate the many pools of unstructured and structured data into a single query-able infrastructure innovation?
  • Is providing a BYOD service for staff, or providing iPad’s for executives or sales staff to do presentations or interface with clients innovation?

No, not really. This is simply evolution of technology. The question is, some technology organisations themselves even keep up with this? We were interested in the results of the 2012 Gartner CIO Agenda Report. The 3 technology areas that CIO’s ranked highest in terms of priority were;

  1. Analytics and Business Intelligence
  2. Mobile Technologies
  3. Cloud Computing (SaaS, IaaS, PaaS)

That in itself isn’t overly surprising. What we found more interesting was looking at how these CIO’s saw the technologies evolving from Emerging, through Developing and to Mainstream. We work a lot with Financial Services companies, so have picked that vertical for the graphic below;

The first area around Big Data/Analytics is largely in line with our view of the market. We see a lot of activity in this space (a some significant hype as well). However, we do concur that by 2015 we expect to see this Mainstream and an increased focus on Data Science as a practice.

Mobile has certainly emerged already and we would expect this to be more in line with the first category. On the device side, technology is moving at a fast pace (in the mobile handset space look at the VIRTUS chipset, which transmits large volumes of data at ultra-high speeds of a reported 2 Gigabits per second. That’s 1,000 times faster than Bluetooth !).

In the area of corporate device support, business application delivery and BYOD, we already see a lot of traction in some organisations. Alongside this new entrants are disrupting the market in terms of mobile payments (such as Monitise).

Lastly, and most surprisingly, whilst financial services see Cloud delivery as a top priority they also see it as Emerging from now through the next 5 years. That can’t be right, can it? (Btw – if you look at the Retail vertical for the same questions, they see all three priorities as Mainstream in the same period).

That brings us back to the question…what do CIO’s consider as Innovation? Reading between the lines of the Gartner survey it clearly differs by vertical. Are financial services organisations less innovative? I’m not sure they are…more conservative, perhaps, but that is to be understood to some degree (see the recently launched Fintech Innovation Lab sponsored by Accenture and many FS firms).

No, what would worry me as a leader within FS is the opening comment from Mr Bueller. Technology and Innovation is certainly moving fast and perhaps the pressure on operational efficiencies, whilst undoubtedly needed, could ultimately detract from bringing new innovation to benefit business and drive competitive value?

There is also a risk that in this climate and with barriers to entry reducing, new entrants could actually gain market share with more agile, functionally rich products and services. We wrote before about the rise of new technology entrepreneurs…there is certainly a danger that this talent pool completely by-passes the financial services technology sector.

Perhaps we do need to “take a moment to stop and look around”. Who in our organisation is responsible for Innovation? Do we have effective Process and Governance? Do we nurture ideas form Concept through to Commercialisation. Some food for thought…