Do you believe that your legacy systems are preventing digital transformation?

Posted on : 14-03-2019 | By : richard.gale | In : Data, Finance, FinTech, Innovation, Uncategorized

Tags: , , , , , , , ,


According to the results of our recent Broadgate Futures Survey more than half of our clients agreed that digital transformation within their organisation was being hampered by legacy systems. Indeed, no one “strongly disagreed” confirming the extent of the problem.

Many comments suggested that this was not simply a case of budget constraints, but the sheer size, scale and complexity of the transition had deterred organisations in fear of the fact that they were not adequately equipped to deliver successful change.

Legacy systems have a heritage going back many years to the days of the mega mainframes of the 70’s and 80’s. This was a time when banks were the masters of technological innovation. We saw the birth of ATMs, BACS and international card payments. It was an exciting time of intense modernisation. Many of the core systems that run the finance sector today are the same ones that were built back then. The only problem is that, although these systems were built to last they were not built for change.

The new millennium experienced another significant development with the introduction of the internet, an opportunity the banks could have seized and considered developing new, simpler, more versatile systems. However, instead they decided to adopt a different strategy and modify their existing systems, in their eyes there was no need to reinvent the wheel. They made additions and modifications as and when required. As a result, most financial organisations have evolved over the decades into organisations of complex networks, a myriad of applications and an overloaded IT infrastructure.

The Bank of England itself has recently been severely reprimanded by a Commons Select Committee review who found the Bank to be drowning in out of date processes in dire need of modernisation. Its legacy systems are overly complicated and inefficient, following a merger with the PRA in 2014 their IT estate comprises of duplicated systems and extensive data overload.

Budget, as stated earlier is not the only factor in preventing digital transformation, although there is no doubt that these projects are expensive and extremely time consuming. The complexity of the task and the fear of failure is another reason why companies hold on to their legacy systems. Better the devil you know! Think back to the TSB outage (there were a few…), systems were down for hours and customers were unable to access their accounts following a system upgrade. The incident ultimately led to huge fines from the Financial Conduct Authority and the resignation of the Chief Executive.

For most organisations abandoning their legacy systems is simply not an option so they need to find ways to update in order to facilitate the connection to digital platforms and plug into new technologies.

Many of our clients believe that it is not the legacy system themselves which are the barrier, but it is the inability to access the vast amount of data which is stored in its infrastructure.  It is the data that is the key to the digital transformation, so accessing it is a crucial piece of the puzzle.

“It’s more about legacy architecture and lack of active management of data than specifically systems”

By finding a way to unlock the data inside these out of date systems, banks can decentralise their data making it available to the new digital world.

With the creation of such advancements as the cloud and API’s, it is possible to sit an agility layer between the existing legacy systems and newly adopted applications. HSBC has successfully adopted this approach and used an API strategy to expand its digital and mobile services without needing to replace its legacy systems.

Legacy systems are no longer the barrier to digital innovation that they once were. With some creative thinking and the adoption of new technologies legacy can continue to be part of your IT infrastructure in 2019!

How is Alternative Data Giving Investment Managers the Edge?

Posted on : 29-03-2018 | By : richard.gale | In : Consumer behaviour, Data, data security, Finance, FinTech, Innovation

Tags: , , ,


Alternative data (or ‘Alt-Data’) refers to data that is derived from a non-traditional source covering a whole array of platforms such as social media, newsfeeds, satellite tracking and web traffic.  There is vast amount of data in cyber space which, until recently remained untouched.  Here we shall look at the role of these unstructured data sets.

Information is the key to the success of any investment manager and information that can give the investor the edge is by no means a new phenomenon.  Traditional financial data, such as stock price history and fundamentals has been the standard for determining the health of a stock. However, alternative data has the potential to reveal insights about a stock’s health before traditional financial data. This has major implications for investors.

If information is power, then unique information sourced from places not-yet-sourced is giving those players the edge in a highly competitive market. Given that we’re in what we like to call a data revolution, where nearly every move we make can be digitized, tracked, and analysed, every company is now a data company. Everyone is both producing and consuming immense amounts of data in the race to make more money. People are well connected on social media platforms and information is available to them is many different forms. Add geographical data into the mix and that’s a lot of data about whose doing what and why. Take Twitter, it is a great tool for showing what’s happening in the world and what is being talked about. Being able to capture sentiment as well as data is a major advance in the world of data analytics.

Advanced analytical procedures can pull all this data together using machine learning and cognitive computing. Using this technology, we can take the unstructured data and transform it into useable data sets at rapid speed.

Hedge funds have been the early adopters and investment managers have now seen the light are expected to spend $7bn by 2020 on alternative data.  All asset managers realise that this data can produce valuable insight and give them the edge in a highly competitive market place.

However, it could be said that if all investment managers research data in this way, then that will put them all on the same footing and the competitive advantage is lost. Commentators have suggested that given the data pool is so vast and the combinations and permutations analysis is of data complex, it is still highly likely that this data can be uncovered that has not been uncovered by someone else. It all depends on the data scientist and where they decide to look. Far from creating a level playing field, where more readily available information simply leads to greater market efficiency, the impact of the information revolution is the opposite. It is creating hard-to access pockets for long-term alpha generation for those players with the scale and resources to take advantage of it.

Which leads us to our next point. A huge amount of money and resource is required to research this data, and this will mean only the strong survive. A report last year by S&P found that 80% of asset managers plan to increase their investments in big data over the next 12 months. Only 6% of asset managers argue that it is not important. Where does this leave the 6%?

Leading hedge fund bosses have warned fund managers they will not survive if they ignore the explosion of big data that is changing the way investors beat the markets. They are

Investing a lot of time and money to develop machine learning in areas of its business where humans can no longer keep up.

There is however one crucial issue which all investors should be aware of and that is the area of privacy. Do you know where that data originates from? Did that vendor have the right to sell the information in the first place?  We have seen this illustrated over the last few weeks with the Facebook “data breach” where Facebook sold on some of its users’ data to Cambridge Analytica without the users’ knowledge. This has wiped $100bn off the Facebook value so we can see the negative impact of using data without the owner’s permission.

The key question in the use of alternative data ultimately is, does it add value? Perhaps too early to tell. Watch this space!

Data is like Oil….Sort Of

Posted on : 30-09-2015 | By : Jack.Rawden | In : Data

Tags: , , , ,


  • We are completely dependent upon it to go about our daily lives
  • It is difficult and expensive to locate and extract and vast tracts of it are currently inaccessible.
  • As technology improves we are able to obtain more of it but the demand constantly outpaces supply.
  • The raw material is not worth much and it is the processing which provides the value, fuels & plastics in the case of oil and business intelligence from data.
  • It lubricates the running of an organisation in the same way as oil does for a car.
  • The key difference between oil and data is that the supply of data is increasing at an ever faster rate whilst the amount of oil is fixed.

So how can data be valued and what exploration mechanisms are available to exploit this asset?

The recent prediction that Google will be the first company to hit the $1 Trillion Market Cap is a good place to start to identify the value of data.  Yes, they have multiple investments in other markets, but the backbone of the organisation is the ability to capture and utilise data effectively. Another similarity is the valuation of Facebook at $86 dollars a share and ~$230B market cap with tangible (accounts friendly) assets of around $45B.  The added value is Data.

This highlights that calculating a company’s data worth or value is now integral in working out the valuation of an organisation. The economic value of a firm’s information assets has recently been termed ‘data equity’ and a new economics discipline, Infonomics, is emerging to provide a structure and foundation of measuring value in data.


The value and so price of organisations could radically alter as the value of its data becomes more transparent. Data equity will at some point be added to the balance sheet of established firms potentially significantly affecting the share price – think about Dun & Bradstreet, the business intelligence service – they have vast amounts of information on businesses and individuals which is sold to help organisations make decisions in terms of credit worthiness. Does the price of D&B reflect the value of that data? Probably not.

Organisations are starting appreciate the value locked up in their data and are utilising technologies to process and analyse the Big Data both within and external to them. These Big Data tools are like the geological maps and exploration platforms for the information world.


  • The volume of data is rising at an ever increasing rate
  • The velocity of that data rushing into and past organisations is accelerating
  • The variety of data has overwhelmed conventional indexing systems


Innovative technology and methods are improving the odds to finding and getting value from that data.

How can an organisation gain value from its data? What are forward thinking firms doing to invest and protect its data?

1. Agree a Common Language

Data is and does mean many things to different firms, departments and people. If there is no common understanding of what a ‘client’ or ‘sale’ or an ‘asset’ is then at the very least confusion will reign and most likely that poor business decisions will be made from the poor data.

This task is not to be underestimated. As organisations grow they build new functions with different thinking, they acquire or are bought themselves and the ‘standard’ definitions of what data means can change and blur. Getting a handle on organisation wide data definitions is a critical and complex set of tasks that need leadership and buy-in. Building a data fabric into an organisation is a thankless but necessary activity in order to achieve longer term value from the firm’s data.


2.Quality, Quality, Quality

The old adage of rubbish in, rubbish out still rings true. All organisations have multiple ‘golden sources’ of data often with legacy transformation and translation rules shunting the data between systems – if a new delivery mechanism is built it is often implemented by reverse engineering the existing feeds to make it the same rather than looking at the underlying data quality and logic. The potential for issues with one of the many consuming systems makes it too risky to do anything else. An alternative is to build a new feed for each new consumer system which de-risks the issue in one sense but builds a bewildering array of pipes crossing an organisation. With any organisation of size it is worth accepting that there will be multiple golden copies of data but the challenge is to make sure they are consistent and have quality checks built in. Reconciling sets of data across systems is great but doesn’t actually check if the data is correct, just that it matches another system….

3. Timeliness

Like most things, data has a time value. As one Chief Data Officer of a large bank recently commented ‘data has a half-life’ – the value decays over time and so ensuring the right data is in the correct place and the right time is essential and out of date/valueless data needs to be identified as such. For example; A correct prediction of tomorrow’s weather is useful, today’s weather is interesting and a report of yesterday’s weather has little value.

4. Organisational Culture

Large organisations are always ‘dealing’ with data problems and providing new solutions to improve data quality. Many large, expensive programmes have been started to solve ‘data’. Thinking about data needs to be more pervasive than that it needs to be part of the culture and fabric of the organisation. Thinking about data (accuracy, ownership, consistency, and time value) needs to be incorporated into organisations as part of the culture, articulating the value of data can help immensely with this.


Understanding what is important rather than having a blanket way of dealing with data is important. Some data doesn’t matter if it is wrong or not up to date because either not consumed (obvious question is – then why have it?) or irrelevant for process.  Other data is critical for a business to survive so a risk based approach to data quality needs to be used and data graded and classified on its value.

6. Data ownership

Someone needs to be accountable for and owner of data and data governance within an organisation. It does not mean that they have to manage each piece but they need to set the strategy and vision for data. More large organisations are now creating a Chief Data Officer role to ensure there is this ownership, strategy and discipline with regard to their data.

Data is the core of a business and there is a growing acknowledgement of its potential value.

As the ability to extract information and intelligence from data improves there will be some disruptive changes in the market value of firms that have the sort of data which can improve the organisations market share, profitability and potentially traded.

Companies that have huge amounts of information regarding their customers: banks, shops, telecoms firms will be well positioned to take advantage of this information if they can manage to organise and exploit it.


Calling the General Election

Posted on : 31-03-2015 | By : richard.gale | In : General News

Tags: , , , , , ,


With the General Election now only 37 days away, can we harness new technology and the “wisdom” of social media crowd to help call it?


Big Data and Sentiment Analysis


In the run up to the election you will hear a lot about the power of big data and sentiment analysis, for example in this announcement from TCS.

The technology behind this and similar projects is undoubtedly clever, although I do think the state of the art in sentiment analysis is being overplayed – I recently attended a talk by a professor of computational linguistics who revealed that it helps with their currently not very effective sentiment analysis of twitter if people include emoticons in their messages!

However this attempt to benefit from the “Wisdom of the Crowd” (often cited as a rationale) is doomed for a reason that is as old as computing itself:


“Garbage in, Garbage out”


The sample set of political tweets will largely encompass two analytically unhelpful groups; political activists, who are the least likely to change their vote and the young who are least likely to vote in the first place!

Indeed, at the close of the Scottish referendum campaign, the SNP were convinced that they had won by a good margin off the back of analysis based partly on trends in social media.

(This is different to the US Republicans in 2012 who, at the close of the presidential campaign, also thought they had won but based on no data at all from the non-working system that they had built. This is not relevant to the current discussion except that both instances led to a narrative of the vote being “stolen” developing amongst true believers but I include the aside as I think this article is required reading on how not to roll-out a project to non-technical users)


Opinion Polls


What about the more familiar opinion polls?

As the polling companies are always keen to tell us, polls provide a snapshot and not a predication.  This is true of course but the snapshot itself is not really a snapshot of how people will vote at that moment in time. It is a snapshot of how they answer the question “which party would you vote for if a General Election were held tomorrow”.

They subsequently give different answers when asked specifically about their constituency and different answers again when the local candidates are named. Liberal Democrats have found in their private polling of their stronghold seats (a fast dwindling set!) that naming the local candidate can make a 10-point difference in their favour (the incumbency effect) and indeed this is the glimmer of hope that they cling to this time round.

What opinion polls also cannot factor in is anything more than a rudimentary use of past knowledge of voter behaviour – polling firms do differ on how they allocate “Don’t know”s, either ignoring them altogether or reallocating a proportion of them based on declared past voting.

Of course this gives its own problems as people are inherently unreliable and misremember. For instance, when asked, far more people claim to have voted Liberal Democrat in 2010 than the total number of votes the party actually received.


Betting markets


Of course the one source that can take all the above and more into account including past knowledge and contemporary analysis is the betting markets.

In pre-internet days, the above was still true but easy access to information and on-line betting has supercharged this in terms of both numbers and overall quality.

In this case, the “Wisdom of the crowd”, often touted for things like sentiment analysis, does actually hold true because the crowd in this case are actually wise (unlike the twitterati), both individually and collectively.

Political betting is a niche pursuit and as such attracts both amateur and professional psephologists along with those with “inside” knowledge. This means that the weight of the money in the market is quite well informed.


Past-it First Past the Post


The importance of inside knowledge is magnified by a creaking voting system that means that national polls and sentiments are all well and good, but the real result lies in the hands of a hundred thousand odd voters in a handful of marginal constituencies.

This means that those will real insight are those on the ground and this time around, the proliferation of smaller parties eating into each of the main parties’ votes makes the situation even more volatile and local knowledge even more important. The constituency betting markets will be made by political activists on the ground with access to detailed internal canvass data.

So my advice would be to ignore the siren call of the new (social media) and the reassurance of the old (opinion polls) and just follow the money, the informed betting money that is.


This article was authored by Andrew Porrer of Heathwest Systems and represents his personal opinions. Andrew can be contacted at

Broadgate Predictions for 2015

Posted on : 29-12-2014 | By : richard.gale | In : Innovation

Tags: , , , , , , , , , , , ,


We’ve had a number of lively discussions in the office and here are our condensed predictions for the coming year.  Most of our clients work with the financial services sector so we have focused on predictions in these areas.  It would be good to know your thoughts on these and your own predictions.


Cloud becomes the default

There has been widespread resistance to the cloud in the FS world. We’ve been promoting the advantages of demand based or utility computing for years and in 2014 there seemed to be acceptance that cloud (whether external applications such as SalesForce or on demand platforms such as Azure) can provide advantages over traditional ‘build and deploy’ set-ups. Our prediction is that cloud will become the ‘norm’ for FS companies in 2015 and building in-house will become the exception and then mostly for integration.

Intranpreneur‘ becomes widely used (again)

We first came across the term Intranpreneur in the late ’80s in the Economist magazine. It highlighted some forward thinking organisations attempt to change culture, to foster,  employ and grow internal entrepreneurs, people who think differently and have a start-up mentality within large firms to make them more dynamic and fast moving. The term came back into fashion in the tech boom of the late ’90s, mainly by large consulting firms desperate to hold on to their young smart workforce that was being snapped up by Silicon Valley. We have seen the resurgence of that movement with banks competing with tech for the top talent and the consultancies trying to find enough people to fulfil their client projects.

Bitcoins or similar become mainstream

Crypto-currencies are fascinating. Their emergence in the last few years has only really touched the periphery of finance, starting as an academic exercise, being used by underground and cyber-criminals, adopted by tech-savvy consumers and firms. We think there is a chance a form of electronic currency may become more widely used in the coming year. There may be a trigger event – such as rapid inflation combined with currency controls in Russia – or a significant payment firm, such as MasterCard or Paypal, starts accepting it.

Bitcoins or similar gets hacked so causing massive volatility

This is almost inevitable. The algorithms and technology mean that Bitcoins will be hacked at some point. This will cause massive volatility, loss of confidence and then their demise but a stronger currency will emerge. The reason why it is inevitable is that the tech used to create Bitcoins rely on the speed of computer hardware slowing their creation. If someone works around this or utilises a yet undeveloped approach such as quantum computing then all bets are off. Also, perhaps more likely, someone will discover a flaw or bug with the creation process, short cut the process or just up the numbers in their account and become (virtually) very rich very quickly.

Mobile payments, via a tech company, become mainstream

This is one of the strongest growth areas in 2015. Apple, Google, Paypal, Amazon, the card companies and most of the global banks are desperate to get a bit of the action. Whoever gets it right, with trust, easy to use great products will make a huge amount of money, tie consumers to their brand and also know a heck of a lot more about them and their spending habits. Payments will only be the start and banking accounts and lifestyle finance will follow. This one product could transform technology companies (as they are the ones that are most likely to succeed) beyond recognition and make existing valuations seem miniscule compared to their future worth.

Mobile payments get hacked

Almost as inevitable as bitcoins getting hacked. Who knows when or how but it will happen but will not impact as greatly as it will on the early crypto-currencies.

Firms wake up to the value of Data Science over Big Data

Like cloud many firms have been talking up the advantages of big data in the last couple of years. We still see situations where people are missing the point. Loading large amounts of disparate information into a central store is all well and good but it is asking the right questions of it and understanding the outputs is what it’s all about. If you don’t think about what you need the information for then it will not provide value or insight to your business. We welcome the change in thinking from Big Data to Data Science.

The monetisation of an individual’s personal data results in a multi-billion dollar valuation an unknown start-up

Long Sentence… but the value of people’s data is high and the price firms currently pay for it is low to no cost. If someone can start to monetise that data it will transform the information industry. There are companies and research projects out there working on approaches and products. One or more will emerge in 2015 to be bought by one of the existing tech players or become that multi-billion dollar firm. They will have the converse effect on Facebook, Google etc that rely on that free information to power their advertising engines.

Cyber Insurance becomes mandatory for firms holding personal data (OK maybe 2016)

It wouldn’t be too far fetched to assume that all financial services firms are currently compromised, either internally or externally. Most firms have encountered either direct financial or indirect losses in the last few years. Cyber or Internet security protection measures now form part of most companies’ annual reports. We think, in addition to the physical, virtual and procedural protection there will be a huge growth in Cyber-Insurance protection and it may well become mandatory in some jurisdictions especially with personal data protection. Insurance companies will make sure there are levels of protection in place before they insure so forcing companies to improve their security further.

Regulation continues to absorb the majority of budgets….

No change then.

We think 2015 is going to be another exciting year in technology and financial services and are really looking forward to it!


Highlights of 2014 and some Predictions for 2015 in Financial Technology

Posted on : 22-12-2014 | By : richard.gale | In : Innovation

Tags: , , , , , , , , , , ,


A number of emerging technology trends have impacted financial services in 2014. Some of these will continue to grow and enjoy wider adoption through 2015 whilst additional new concepts and products will also appear.

Financial Services embrace the Start-up community

What has been apparent, in London at least, is the increasing connection between tech and FS. We have been pursuing this for a number of years by introducing great start-up products and people to our clients and the growing influence of TechMeetups, Level39 etc within the financial sector follows this trend. We have also seen some interesting innovation with seemingly legacy technology  – Our old friend Lubo from L3C offers mainframe ‘on demand’ and cut-price, secure Oracle databases an IBM S3 in the cloud! Innovation and digital departments are the norm in most firms now staffed with clever, creative people encouraging often slow moving, cumbersome organisations to think and (sometimes) act differently to embrace different ways of thinking. Will FS fall out of love with Tech in 2015 – we don’t think so. There will be a few bumps along the way but the potential, upside and energy of start-ups will start to move deeper into large organisations.

Cloud Adoption

FS firms are finally facing up to the cloud. Over the last five years we have bored too many people within financial services talking about the advantages of the cloud. Our question ‘why have you just built a £200m datacentre when you are a bank not an IT company?’ was met with many answers but two themes were ‘Security’ and ‘We are an IT company’…. Finally, driven by user empowerment (see our previous article on ‘user frustration vs. empowerment) banks and over financial organisations are ’embracing’ the cloud mainly with SaaS products and IaaS using private and public clouds. The march to the cloud will accelerate over the coming years. Looking back from 2020 we see massively different IT organisations within banks. The vast majority of infrastructure will be elsewhere, development will take place by the business users and the ‘IT department’ will be a combination of rocket scientist data gurus and procurement experts managing and tuning contracts with vendors and partners.

Mobile Payments

Mobile payments have been one of the discussed subjects of the past year. Not only do mobile payments enable customers to pay without getting their wallets out but using a phone or wearable will be the norm in the future. With new entrants coming online every day, offering mobile payment solutions that are faster and cheaper than competitors is on every bank’s agenda. Labelled ‘disruptors’ due to the disruptive impact they are having on businesses within the financial service industry (in particular banks), many of these new entrants are either large non-financial brands with a big customer-base or start-up companies with fresh new solutions to existing issues.

One of the biggest non-financial companies to enter the payments sector in 2014 was Apple. Some experts believe that Apple Pay has the power to disrupt the entire sector. Although Apple Pay has 500 banks signed up and there is competition from card issuers to get their card as the default card option under Apple devices, some banks are still worried that Apple Pay and other similar service will make their branches less important. If Apple chose to go into retail banking seriously by offering current accounts then the banks would have plenty more to worry them.


The fusion of development, operations and business teams to provide agile, focussed solutions has been one of the growth areas in 2014. The ‘DevOps’ approach has transformed many otherwise slow, ponderous IT departments into talking to their business & operational consumers of their systems and providing better, faster and closer-fit applications and processes. This trend is only going to grow and 2015 maybe the year it really takes off. The repercussions for 2016 are that too many projects will become ‘DevOpped’ and start failing through focussing on short term solutions rather than long term strategy.


Obviously the Sony Pictures hack is on everyone’s mind at the moment but protection against cyber attack from countries with virtually unlimited will, if not resources, is a threat that most firms cannot protect against. Most organisations have had a breach of some type this year (and the others probably don’t know it’s happened). Security has risen up to the boardroom and threat mitigation is now published on most firms annual reports. We see three themes emerging to combat this.

– More of the same, more budget and resource is focussed on organisational protection (both technology and people/process)
– Companies start to mitigate with the purchase of Cyber Insurance
– Governments start to move from defence/inform to attacking the main criminal or political motivated culprits

We hope you’ve enjoyed our posts over the last few years and we’re looking forward to more in 2015.



Broadgate Big Data Dictionary

Posted on : 28-10-2014 | By : richard.gale | In : Data

Tags: , , , , , , , , , , ,


A couple of years back we were getting to grips with big data and thought it would be worthwhile putting a couple of articles together to help explain what the fuss was all about. Big Data is still here and the adoption of it is growing so we thought it would be worthwhile updating and re-publishing. Let us know what you think?

We have been interested in Big Data concepts and technology for a while. There is a great deal of interest and discussion with our clients and associates on the subject of obtaining additional knowledge & value from data.

As with most emerging ideas there are different interpretations and meanings for some of the terms and technologies (including the thinking that ‘big data’ isn’t new at all but just a new name for existing methods and techniques).

With this in mind we thought it would be useful to put together a few terms and definitions that people have asked us about recently to help frame Big Data.

We would really like to get feedback, useful articles & different views on these to help build a more definitive library of Big Data resources.


Big Data Analytics is the processing and searching through large volumes of unstructured and structured data to find hidden patterns and value. The results can be used to further scientific or commercial research, identify customer spending habits or find exceptions in financial, telemetric or risk data to indicate hidden issues or fraudulent activity.

Big Data Analytics is often carried out with software tools designed to sift and analyse large amounts of diverse information being produced at enormous velocity. Statistical tools used for predictive analysis and data mining are utilised to search and build algorithms.

Big Data

The term Big Data describes amounts of data that are too big for conventional data management systems to handle. The volume, velocity and variety of data overwhelm databases and storage. The result is that either data is discarded or unable to be analysed and mined for value.

Gartner has coined the term ‘Extreme Information Processing’ to describe Big Data – we think that’s a pretty good term to describe the limits of capability of existing infrastructure.

There has always been “big data” in the sense that data volumes have always exceeded the ability for systems to process it. The tool sets to store & analyse and make sense of the data generally lag behind the quantity and diversity of information sources.

The actual amounts and types of Big Data this relates to is constantly being redefined as database and hardware manufacturers are constantly moving those limits forward.

Several technologies have emerged to manage the Big Data challenge. Hadoop has become a favourite tool to store and manage the data, traditional database manufacturers have extended their products to deal with the volumes, variety and velocity and new database firms such as ParAccel, Sand & Vectorwise have emerged offering ultra-fast columnar data management systems. Some firms, such as Hadapt, have a hybrid solution utilising tools from both the relational and unstructured world with an intelligent query optimiser and loader which places data in the optimum storage engine.

Business Intelligence

The term Business Intelligence(BI) has been around for a long time and the growth of data and then Big Data has focused more attention in this space. The essence of BI is to obtain value from data to help build business benefits. Big Data itself could be seen as BI – it is a set of applications, techniques and technologies that are applied to an entities data to help produce insight and value from it’s data.

There are a multitude of products that help build Business Intelligence solutions – ranging from the humble Excel to sophisticated (aka expensive) solutions requiring complex and extensive infrastructure to support. In the last few years a number of user friendly tools such as Qlikview and Tableau have emerged allowing tech-savvy business people to exploit and re-cut their data without the need for input from the IT department.

Data Science

This is, perhaps, the most exciting area of Big Data. This is where the Big Value is extracted from the data. One of our data scientist friends described it as follows: ” Big Data is plumbing and that Data Science is the value driver…”

Data Science is a mixture of scientific research techniques, advance programming and statistical skills (or hacking), philosophical thinking (perhaps previously known as ‘thinking outside the box’) and business insight. Basically it’s being able to think about new/different questions to ask, be technically able to intepret them into a machine based format, process the result, interpret them and then ask new questions based from the results of the previous set…

A diagram by blogger Drew Conway  describes some of the skills needed – maybe explains the lack of skills in this space!


In addition Pete Warden (creator of the Data Science Toolkit) and others have raised caution on the term Data Science “Anything that needs science in the name is not a real science” but confirms the need to have a definition of what Data Scientists do.


Databases can generally be divided into structured and unstructured.

Structured are the traditional relational database management systems such as Oracle, DB2 and SQL-Server which are fantastic at organising large volumes of transactional and other data with the ability to load and query the data at speed with an integrity in the transactional process to ensure data quality.

Unstructured are technologies that can deal with any form of data that is thrown at them and then distribute out to a highly scalable platform. Hadoop is a good example of this product and a number of firms now produce, package and support the open-source product.

Feedback Loops

Feedback loops are systems where the output from the system are fed back into it to adjust or improve the system processing. Feedback loops exist widely in nature and in engineering systems – think of an oven – heat is applied to warm to a specific temperature and is measured by a thermostat – once the correct temperature is reached the thermostat informs the heating element and it shuts down until feedback from the thermostat says it is getting too cold and it turns on again… and so on.

Feedback loops are an essential part of extracting value from Big Data. Building in feedback and then incorporating Machine Learning methods start to allow systems to become semi-autonomous, this allows the Data Scientists to focus on new and more complex questions whilst testing and tweaking the feedback from their previous systems.


Hadoop is one of the key technologies to support the storage and processing of Big Data. Hadoop emerged from Google and its distributed Google File System and Mapreduce processing tools. It is an open source product under the Apache banner but, like Linux, is distributed by a number of commercial vendors that add support, consultancy and advice on top of the products.

Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

So Hadoop could almost be seen as a (big) bucket where you can throw any form and quantity of data into it and it will organise and know where that data resides and can retrieve and process it. It also accepts that there may be holes in the bucket and can patch them up by using additional resources to patch itself up – all in all very clever bucket!!

Hadoop runs on a scheduling basis so when a question is asked it breaks up the query and shoots them out to different parts of the distributed network in parallel and then waits and collates the answers.


Hive provides a high level, simple, SQL type language to enable processing of and access to data stored in Hadoop files. Hive can provide analytical and business intelligence capability on top of Hadoop. The Hive queries are translated into a set of MapReduce jobs to run against the data. The technology is used by many large technology firms in their products including Facebook and Last.FM. The latency/batch related limitations of MapReduce are present in Hive too but the language allows non-Java programmers to access and manipulate large data sets in Hadoop.

Machine Learning

Machine learning is one of the most exciting concepts in the world of data. The idea is not new at all but the focus on utilising feedback loops of information and algorithms that take actions and change depending on the data without manual intervention could improve numerous business functions. The aim is to find new or previously unknown patterns & linkages between data items to obtain additional value and insight. An example of machine learning in action is Netflix which is constantly trying to improve its movie recommendation system based on a user’s previous viewing, their characteristics and also the features of their other customers with a similar set of attributes.


Mapreduce is a framework for processing large amounts of data across a large number of nodes or machines.
Map Reduce diagram (courtesy of Google)

Mapreduce works by splitting out (or mapping) requests into multiple separate tasks to be performed on many nodes of the system and then collates and summarises the results back (or reduces) to the outputs.

Mapreduce based on the java language and is the basis of a number of the higher level tools (Hive, Pig) used to access and manipulate large data sets.

Google (amongst others) developed and use this technology to process large amounts of data (such as documents and web pages trawled by its web crawling robots). It allows the complexity of parallel processing, data location and distribution and also system failures to be hidden or abstracted from the requester running the query.


MPP stands for massively parallel processing and it is the concept which gives the ability to process the volumes (and velocity and variety) of data flowing through systems. Chip processing capabilities are always increasing but to cope with the faster increasing amounts of data processing needs to be split across multiple engines. Technology that can split out requests into equal(ish) chunks of work, manage the processing and then join the results has been difficult to develop.  MPP can be centralised with a cluster of chips or machines in a single or closely coupled cluster or distributed where the power of many distributed machines are used (think ‘idle’ desktop PCs overnight usage as an example). Hadoop utilises many distributed systems for data storage and processing and also has fault tolerance built in which enables processing to continue with the loss of some of those machines.


NoSQL really means ‘not only SQL’, it is the term used for database management systems that do not conform to the traditional RDBMS model (transactional oriented data management systems based on the ACID principle). These systems were developed by technology companies in response to challenges raised by the high volumes of data. Amazon, Google and Yahoo built NoSQL systems to cope with the tidal wave of data generated by their users.


Apache Pig is a platform for analysing huge data sets. It has a high-level language called Pig Latin which is combined with a data management infrastructure which allows high levels of parallel processing. Again, like Hive, the Pig Latin is compiled into MapReduce requests. Pig is also flexible so additional functions and processing can be added by users for their own specific needs.

Real Time

The challenges in processing the “V”‘s in big data (volume, velocity and variety) have meant that some requirements have been compromised. It the case of Hadoop and Mapreduce this has been the interactive or instant availability of the results. Mapreduce is batch orientated in the sense that requests are sent for processing where they are then scheduled to be run and then the output summarised. This works fine for the original purposes but now the ability to become more real-time or interactive are growing. With a ‘traditional’ database or application users expect the results to be available instantly or pretty close to instant. Google and others are developing more interactive interfaces to Hadoop. Google has Drill and Twitter has release Storm. We see this as one of the most interesting areas of development in the Big Data space at the moment.


Over the next few months we have some guest contributors penning their thoughts on the future for big data, analytics and data science.  Also don’t miss Tim Seears’s (TheBigDataPartnership) article on maximising value from your data “Feedback Loops” published here in June 2012.

For the technically minded Damian Spendel also published some worked examples using ‘R’ language on Data Analysis and Value at Risk calculations.

These are our thoughts on the products and technologies – we would welcome any challenges or corrections and will work them into the articles.


Calculating Value at Risk using R

Posted on : 30-09-2014 | By : richard.gale | In : Data

Tags: , , , , , ,



My recent article focused on using R to perform some basic exploratory data analysis1.

The focus of this article will be to highlight some packages that focus on financial analytics (TTR, quantmod and PerformanceAnalytics) and a package that will allow us to build an interactive UI with a package called Shiny.

For this article we will focus on Value at Risk2, a common market risk measure developed by JP Morgan and most recently criticized by Nassim Taleb3.

Historical Simulation – Methodology

For the first part of this article I will walk through the methodology of calculating VaR for a single stock using the historical simulation method (as opposed to the Monte Carlo or parametric method)4.

VaR allows a risk manager to make a statement about a maximum loss over a specified horizon at a certain confidence level.

V will be the Value at Risk for a one day horizon at a 95% confidence level.

Briefly, this method is: retrieve and sort a returns timeseries from a specified period (usually 501 days) and take a specific quantile and you will have the Value at Risk for that position.

Note however this will only apply to a single stock, I will cover multiple stocks in a later article. Normally a portfolio will not only include multiple stocks, but forwards, futures and other derivative positions.

In R, we would proceed as follows.

 ##pre-requisite packages 

With the packages loaded we can now run through the algorithm:

 X <- c(0.95) 
stock <- c("AA") ##American Airlines 
## define the historical timeseries 
begin <- Sys.Date() - 501 
end <- Sys.Date() 
## first use of quantmod to get the ticker and populate our dataset with 
the timeseries of Adjusted closing price 
tickers <- getSymbols(stock, from = begin, to = end, auto.assign = TRUE) 
dataset <- Ad(get(tickers[1])) 
## now we need to convert the closing prices into a daily returns 
timeseries - we will use the Performance Analytics package 
returns_AA <- Return.calculate(dataset, method=c("simple"))

We now have the dataset and can start to do some elementary plotting, firstly the returns timeseries to have a quick look:



Now, we’ll convert the timeseries into a sorted list and apply the quantile function

 ##convert to matrix datatype as zoo datatypes can't be sorted, then sort ascending 
returns_AA.m <- as.matrix(returns_AA); sorted <- 
##calculate the 5th percentile, 
##na.rm=TRUE tells the function to ignore NA values (not available values) 
100*round(quantile(returns_AA.m[order(returns_AA.m[,1])], c(1-X), na.rm=TRUE), 4) 
## 5% 
## -2.14

This shows us that the 5% one day value at risk for a position in American Airlines is -2.14%, that is, for $100 of position, once every 20 days you would lose more than $2.14.

Building a UI

A worthwhile guide to using Shiny is available on the Shiny Website. (

In essence, we will need to define two files in one directory, server.R and UI.R.

We’ll start with the UI code, not that I have used the “Telephones by Region” as a template (

The basic requirements are:

  1. A drop-down box to choose the stock.
  2. A function that plots a histogram of the returns time-series and shows the VaR as a quantile on the histogram.
##get the dataset for the drop-down box, 
##we'll use the TTR package for downloading a vector of stocks, 
##and load this into the variable SYMs 
suppressWarnings(SYMs <- TTR::stockSymbols()) 
##use the handy sqldf package to query dataframes using SQL syntax
##we'll focus on Banking stocks on the NYSE. 
SYMs <- sqldf("select Symbol from SYMs where Exchange='NYSE' and Industry like '%Banks%'") 
# Define the overall UI, shamelessly stolen from the shiny gallery 
 # Use a fluid Bootstrap layout 
 # Give the page a title 
 titlePanel("NYSE Banking Stocks - VaR Calculator"), 
 # Generate a row with a sidebar, calling the sidebar "Instrument" and populating the choices with the vector SYMs 
 sidebarLayout( selectInput("Instrument", "Instrument:", choices=SYMs), 
# Create a spot for the histogram 

With the UI layout defined, we can now define the functions in the Server.R code:

shinyServer(function(input, output){ 
# Fill in the spot we created in UI.R using the code under "renderPlot" 
 ##use the code shown above to get the data for the chosen instrument captured in input$Instrument 
 begin <- Sys.Date() - 501 
 end <- Sys.Date() 
 tickers <- getSymbols(input$Instrument, from = begin, to = end, 
 auto.assign = TRUE) 
 dataset <- Ad(get(tickers[1])) 
 dataset <- dataset[,1]
 returns <- Return.calculate(dataset, method=c("simple")) 
 ##use the quantmod package that creates the histogram and adds 95% VaR using the add.risk method 
 chart.Histogram(returns, methods = c("add.risk")) 

In RStudio, you will then see the button “Run App”, which after clicking will run your new and Shiny app.


Guest author: Damian Spendel – Damian has spent his professional life bringing value to organisations with new technology. He is currently working for a global bank helping them implement big data technologies. You can contact Damian at


Data Analysis – An example using R

Posted on : 31-08-2014 | By : richard.gale | In : Data

Tags: , , , , , ,


With the growth of Big Data and Big Data Analytics the programming language R has become a staple tool for data analysis.   Based on modular packages (4000 are available), it offers sophisticated statistical analysis and visualization capabilities.   It is well supported by a strong user community.   It is also Open Source.

For the current article I am assuming an installation of R 3.1.1[1] and RStudio[2].

The article will cover the steps taken to provide a simple analysis of highly structured data using R.

The dataset I will use in this brief demonstration of some basic R capabilities is from a peer-to-peer Lending Club in the United States[3].  The dataset is well structured with a data dictionary and covers both loans made and loans rejected.   I will use R to try and find answers to the following questions:

  • Is there a relationship between Loan amounts and funded amounts?
  • Is there a relationship between the above and the number of previous public bankruptcies?
  • What can we find out about rejections?
  • What can we find out about the geographic distribution of the Lending Club loans?

During the course of this analysis we will use basic commands R to:

  • Import data
  • Plot data using scatterplots and regression lines
  • Use functions to perform heavy lifting
  • Summarize data using simple aggregation functions
  • Plot data using choropleths

Having downloaded the data to our working directory we’ll import the three files using read.csv and merge them together using rbind() (row bind):

>data_lending0 <- read.csv(“LoanStats3a.csv”, header =FALSE)

>data_lending1 <- read.csv(“LoanStats3b.csv”, header =FALSE)

>data_lending2 <- read.csv(“LoanStats3c.csv”, header =FALSE)

>data_full <- rbind(data_lending0, data_lending1, data_lending2)

We can now explore the data using some of R’s in-built functions for metadata – str (structure), names (column names), unique (unique values of a variable).

The first thing I will do is use ggplot to build a simple scatter plot showing the relationship between the funded amount and the loan amount.



>ggplot(data_full, aes(x=loan_amnt, y=funded_amnt)) + geom_point(shape=1) + geom_smooth(method=lm)

The above  three lines install the package ggplot2 from a CRAN mirror, load the library into the R environment, and then use the “Grammar of Graphics” to build a plot using the data_full dataset, with the x-axis showing the principle (loan_amnt) and the y-axis showing the lender’s contribution (funded_amnt).  With geom_smooth we add a line to help see patterns – in this case a line of best fit.

In R Studio we’ll now see the following plot:



This shows us clearly that the Lending Club clusters loans at the lower end of the spectrum and that there is a clear positive correlation between the loan_amount and funded_amnt – for every dollar you bring you can borrow a dollar, there is little scope for leverage here.   Other ggplot functions will allow us to tidy up the labeling and colours, but I’ll leave that as an exercise for the interested reader.



The next step is to add an additional dimension – and investigate the link between principles and contributions under the aspect of known public bankruptcies of the applicants.

>ggplot(data_full, aes(x=loan_amnt, y=funded_amnt, color=pub_rec_bankruptcies)) + geom_point(shape=19,alpha=0.25) + geom_smooth(method=lm)

Here, I’ve used the color element to add the additional dimension and attempted to improve legibility of the visualization by making the points more transparent.



Not very successfully, it doesn’t help us further – maybe sampling could improve the visualization, or a more focused view….





Let’s have a quick look at the rejection statistics:

> data1 <- rbind(read.csv(“RejectStatsA.csv”), read.csv(“RejectStatsB.csv”))

> nrow(rejections)/nrow(data_full)

[1] 6.077912

For every application – six rejections.

Another popular method of visualization is using choropleth (“many places”) visualizations.   In this case, we’ll build a map showing outstanding loans by State.

The bad news is that the Lending Club data uses two-letter codes and that state data we’ll use from the maps package (install.packages, library etc….) is the full name.   Fortunately,  a quick search provides a function “stateFromLower”[4] that will perform this for us.   So, I run the code that creates the function and then add a new column called state (“WY”) to the data_full dataset, and use stateFromLower to conver the addr_state column (“Wyoming”):

> data_full$state <- stateFromLower(data_lending$addr_state

Then, I aggregate the principles by state:

> loan_amnts <- aggregate(data_lending$funded_amnt, by=list(data_lending$state), FUN=sum)

Load the state data:

> data(state)

The next code leans heavily on a succinct tutorial provided elsewhere[5]:

> map(“state”, fill=FALSE, boundary=TRUE, col=”red”)

> mapnames <- map(“state”, plot=FALSE)$names

Remove regions:

> region_list <- strsplit(mapnames, “:”)

> mapnames2 <- sapply(region_list, “[“, 1)

Match the region-less mapnames to the loan amounts:

> m <- match(mapnames2, tolower(

> loans <- loan_amnts$x

Bucketize the aggregated loans:

> loan.buckets <- cut(loans, c(“500000”, “1000000”, “5000000”, “10000000”, “15000000”, “20000000”, “30000000”, “40000000”, “90000000”, “100000000”))

Define the colour schema:

> clr <- rev(heat.colors(13))

Draw the map and add the title:

> map(“state”, fill=TRUE, col=clr[loan.buckets], projection = “polyconic”)

> title(“Lending Club Loan Amounts by State”)

Create a legend:

> leg.txt <- c(“$500000”, “$1000000”, “$5000000”, “$10000000”, “$15000000”, “$20000000”, “$30000000”, “$40000000”, “$90000000”, “$100000000”)

> legend(“topright”, leg.txt, horiz = FALSE, fill = clr)

With a few simple lines of code R has demonstrated that it is quite a powerful tool for generating visualizations that can aid in understanding  and analyzing data.    We were able to understand something about the lending club – it almost seems like we have clear KPIs in terms of rejections and in terms of loan amounts to funding amounts.    I think we can also see a link between the Lending Club’s presence and poverty[6].

This could be a starting point for a more detailed analysis into the Lending Club, to support investing decisions, or internal business decisions (reduce rejections, move into Wyoming etc.).

Guest author: Damian Spendel – Damian has spent his professional life bringing value to organisations with new technology. He is currently working for a global bank helping them implement big data technologies. You can contact Damian at

“People Analytics” – Can robots replace the recruiters?

Posted on : 28-07-2014 | By : john.vincent | In : Innovation

Tags: , , , , , , , ,


The recruitment industry has been largely unchanged for many years. Technology has, of course, changed the way that companies and individuals interact in the process, from online job and candidate postings with companies like Jobserve and Monster, company recruitment portals to engage with and measure preferred suppliers and online screening of candidates prior to onboarding.

However, we are now at a point where technology can really disrupt the industry through the use of Big Data. The ability to not only hire, but equally as important, retain better talent through the use of what is being called “people analytics” is now a reality. By mining the huge amounts of data that potential candidates leave, either willingly or otherwise, in their daily digital lives is allowing companies to assess the value of existing and future employees.

We won’t get into the whole privacy thing…that’s for another day.

According to Prof Peter Capelli at the Centre for Human Resources at Wharton, big data can predict successful hires better than a companies HR department.

While HR researchers have been kicking around small and simple sets of data, much of it collected decades ago, the big-data people have fresh information on hundreds of thousands of people — in some cases, millions of people — and the information includes all kinds of performance measures, attributes of the individual employers, their experience and so forth. There are a lot of new things to look at.

Now, I’m sure there are a lot of HR professionals who would argue with this! However, like all industries where technology advancements have enabled new business practices and efficiencies, recruitment is no different.

Let’s look at the evolution in one specific area, recruitment of technology professionals themselves. During the technology boom years, agencies specialising in finding talent for companies sprung up at a fast pace, armed with a collection of job board subscriptions and expense account. The game was simple….it was all about speed. How quickly could a CV hit the desk of a hiring manager.

When demand outstripped supply the question of selecting the absolute best fit candidate could often be secondary. Get someone quick…in fact, if they’ve only got 50% of the role requirements then get two!… Demand was high, margins were high and everybody was happy.

Things have changed dramatically since 2008. As demand tailed off so did margins for recruitment firms, with in-house managed services firms putting the final nail in for many new entrants.

So, now with “people analytics” in full swing, are we entering a phase where the recruitment industry will fade away completely. Of course not. For certain roles, or levels of seniority, human interaction throughout the whole process from role requirements, through search and selection is a necessity.

However, for some roles such as developers, software engineers or analysts, the use of algorithms rather than traditional routes can uncover a whole new talent pool, through techniques such as actually mining open source code. According to Dr Vivienne Ming of Gild, a specialist tech recruiter;

There are about 100 times as many qualified but un-credentialed candidates out there, at every level of ability. Organizations are creating their own blind spots, which leads to companies paying too much for their hires and to talent being squandered

Indeed, when the University of Minnesota analysed 17 studies evaluating job applicants, they actually found that human decisions were outperformed by a simple equation by at least 25%.

So, the days of the CV may be numbered. Smart companies are not waiting to advertise a role and harvest applications through their traditional channels, but are more sourcing candidates directly by casting the net into the social media waters, looking at blogs and the like. A recent survey showed that some 44% of companies looked at these platforms before hiring and candidates are now much more aware of their social media brand.

The use of people analytics continues post hire to further develop, nurture and retain talent. An example of this is actually in the world of recruitment itself where Social Talent has developed a data tool which it is testing on 2000 individuals. By analysing their daily activity, from emails, phone calls, browsing, candidate key word searching etc… it is able to build a profile of the most successful techniques and provide constructive advice through popup messages in real time.

So where does that leave the recruiters on both sides of the fence? Well, some of the smart providers are developing their own platforms to provide their customers with advanced people analytics whilst on the client side, we see the focus shifting to a smaller subset of organisational roles.

As for the traditional HR role in the talent process, we’ll leave the last word to Peter Capelli;

My bet is that the CIO offices in most big companies will soon start using all the data they have (which is virtually everything) to build models of different aspects of employee performance, because that’s where the costs are in companies and it’s also the unexamined turf in business