Data Analysis – An example using R

Posted on : 31-08-2014 | By : richard.gale | In : Data

Tags: , , , , , ,

3

With the growth of Big Data and Big Data Analytics the programming language R has become a staple tool for data analysis.   Based on modular packages (4000 are available), it offers sophisticated statistical analysis and visualization capabilities.   It is well supported by a strong user community.   It is also Open Source.

For the current article I am assuming an installation of R 3.1.1[1] and RStudio[2].

The article will cover the steps taken to provide a simple analysis of highly structured data using R.

The dataset I will use in this brief demonstration of some basic R capabilities is from a peer-to-peer Lending Club in the United States[3].  The dataset is well structured with a data dictionary and covers both loans made and loans rejected.   I will use R to try and find answers to the following questions:

  • Is there a relationship between Loan amounts and funded amounts?
  • Is there a relationship between the above and the number of previous public bankruptcies?
  • What can we find out about rejections?
  • What can we find out about the geographic distribution of the Lending Club loans?

During the course of this analysis we will use basic commands R to:

  • Import data
  • Plot data using scatterplots and regression lines
  • Use functions to perform heavy lifting
  • Summarize data using simple aggregation functions
  • Plot data using choropleths

Having downloaded the data to our working directory we’ll import the three files using read.csv and merge them together using rbind() (row bind):

>data_lending0 <- read.csv(“LoanStats3a.csv”, header =FALSE)

>data_lending1 <- read.csv(“LoanStats3b.csv”, header =FALSE)

>data_lending2 <- read.csv(“LoanStats3c.csv”, header =FALSE)

>data_full <- rbind(data_lending0, data_lending1, data_lending2)

We can now explore the data using some of R’s in-built functions for metadata – str (structure), names (column names), unique (unique values of a variable).

The first thing I will do is use ggplot to build a simple scatter plot showing the relationship between the funded amount and the loan amount.

>install.packages(“ggplot2″)

>library(ggplot2)

>ggplot(data_full, aes(x=loan_amnt, y=funded_amnt)) + geom_point(shape=1) + geom_smooth(method=lm)

The above  three lines install the package ggplot2 from a CRAN mirror, load the library into the R environment, and then use the “Grammar of Graphics” to build a plot using the data_full dataset, with the x-axis showing the principle (loan_amnt) and the y-axis showing the lender’s contribution (funded_amnt).  With geom_smooth we add a line to help see patterns – in this case a line of best fit.

In R Studio we’ll now see the following plot:

 

 

This shows us clearly that the Lending Club clusters loans at the lower end of the spectrum and that there is a clear positive correlation between the loan_amount and funded_amnt – for every dollar you bring you can borrow a dollar, there is little scope for leverage here.   Other ggplot functions will allow us to tidy up the labeling and colours, but I’ll leave that as an exercise for the interested reader.

 

 

The next step is to add an additional dimension – and investigate the link between principles and contributions under the aspect of known public bankruptcies of the applicants.

>ggplot(data_full, aes(x=loan_amnt, y=funded_amnt, color=pub_rec_bankruptcies)) + geom_point(shape=19,alpha=0.25) + geom_smooth(method=lm)

Here, I’ve used the color element to add the additional dimension and attempted to improve legibility of the visualization by making the points more transparent.

 

 

Not very successfully, it doesn’t help us further – maybe sampling could improve the visualization, or a more focused view….

 

 

 

 

Let’s have a quick look at the rejection statistics:

> data1 <- rbind(read.csv(“RejectStatsA.csv”), read.csv(“RejectStatsB.csv”))

> nrow(rejections)/nrow(data_full)

[1] 6.077912

For every application – six rejections.

Another popular method of visualization is using choropleth (“many places”) visualizations.   In this case, we’ll build a map showing outstanding loans by State.

The bad news is that the Lending Club data uses two-letter codes and that state data we’ll use from the maps package (install.packages, library etc….) is the full name.   Fortunately,  a quick search provides a function “stateFromLower”[4] that will perform this for us.   So, I run the code that creates the function and then add a new column called state (“WY”) to the data_full dataset, and use stateFromLower to conver the addr_state column (“Wyoming”):

> data_full$state <- stateFromLower(data_lending$addr_state

Then, I aggregate the principles by state:

> loan_amnts <- aggregate(data_lending$funded_amnt, by=list(data_lending$state), FUN=sum)

Load the state data:

> data(state)

The next code leans heavily on a succinct tutorial provided elsewhere[5]:

> map(“state”, fill=FALSE, boundary=TRUE, col=”red”)

> mapnames <- map(“state”, plot=FALSE)$names

Remove regions:

> region_list <- strsplit(mapnames, “:”)

> mapnames2 <- sapply(region_list, “[“, 1)

Match the region-less mapnames to the loan amounts:

> m <- match(mapnames2, tolower(state.name))

> loans <- loan_amnts$x

Bucketize the aggregated loans:

> loan.buckets <- cut(loans, c(“500000”, “1000000”, “5000000”, “10000000”, “15000000”, “20000000”, “30000000”, “40000000”, “90000000”, “100000000”))

Define the colour schema:

> clr <- rev(heat.colors(13))

Draw the map and add the title:

> map(“state”, fill=TRUE, col=clr[loan.buckets], projection = “polyconic”)

> title(“Lending Club Loan Amounts by State”)

Create a legend:

> leg.txt <- c(“$500000”, “$1000000”, “$5000000”, “$10000000”, “$15000000”, “$20000000”, “$30000000”, “$40000000”, “$90000000”, “$100000000”)

> legend(“topright”, leg.txt, horiz = FALSE, fill = clr)

With a few simple lines of code R has demonstrated that it is quite a powerful tool for generating visualizations that can aid in understanding  and analyzing data.    We were able to understand something about the lending club – it almost seems like we have clear KPIs in terms of rejections and in terms of loan amounts to funding amounts.    I think we can also see a link between the Lending Club’s presence and poverty[6].

This could be a starting point for a more detailed analysis into the Lending Club, to support investing decisions, or internal business decisions (reduce rejections, move into Wyoming etc.).

Guest author: Damian Spendel – Damian has spent his professional life bringing value to organisations with new technology. He is currently working for a global bank helping them implement big data technologies. You can contact Damian at damian.spendel@gmail.com

An independent Scotland: “Yes” or “No”, technology needs to prepare

Posted on : 31-08-2014 | By : jo.rose | In : General News

0

Things are hotting up on the Scottish independence referendum with only a few weeks to go now until we finally find out the decision of the people on September 18th. In my view, last weeks live TV debate between Salmond and Darling did nothing really to help steer the undecided in a particular direction from a substance perspective. Like a lot of politics it was more about debating style and posturing.

However, outside of the key issue regarding North Sea oil, Scottish currency plans, the health service and Trident, what about the impact of a “Yes” vote in terms of technology, both for corporates and commercial tech companies?

Sir Martin Sorrell, chief executive of WPP, has warned against the impact the uncertainty is having on businesses.

It heightens the level of uncertainty, so it’s not good. There is a vote and we don’t know what the result is going to be. Whether it’s a yes or a no, any business is having to think about what the implications of yes are

As we are finally coming out of a long recession this debate is, for many UK bosses, an unwanted distraction. A recent Business Instincts survey (a poll of bosses at businesses throughout the UK) found that under-investment in technology has left many firms with skills shortages and ageing information technology infrastructure.

Getting the best out of IT investment topped the list of technical issues for firms saying under-investment during the downturn had left them needing to catch up with fast paced changes

This is a key issue and a theme that we see in discussions with many of clients. With capital expenditure on technology dramatically scaled back, or cut completely during the recession, many companies now have significant investment programmes under way.

Technology change is also more fast paced today than it ever has been, both in terms of independent software vendor upgrades and innovation or disruptive technologies. Organisations need to catch up on both fronts, from a risk perspective of being non-current/non-compliant with areas such as security vulnerability management and also from a competitive perspective. It’s a difficult environment and one where the further implications of a potential separation of Scotland is unwelcome.

Financial Services is of course one of the areas where the impact would be felt most. In an industry already struggling to catch up and deal with the “tsunami of new regulation”, a “Yes” vote would bring in a whole new set of problems with financial reporting structures, tax, risk management and potentially a new currency.

Naturally, all of these have underpinning technology implications. A good comparison is the acquisition and subsequent dismantling of ABN AMRO bank by a consortium back in 2008. Each party “carved out” their respective businesses from what was a tightly highly complex business and technology infrastructure. The scale of the change was completely under-estimated by the acquirers.

End to end services are rarely loosely coupled, either at the application or infrastructure level, and often are a managed by multiple vendors as well as internally, and ABN AMRO was no different. Separation plans that were over ambitiously tabled in “numbers of months” were rebaselined many times and months quickly became years, with technology costs for people and infrastructure following suit.

The problem is not limited to large corporates. Technology SME’s operating in the UK have similar issues. One such is AAISP a small niche provider that prides itself on providing specialist technical support to its customers. While it sells a range of ADSL and fiber-based broadband services throughout the UK, they have expressed doubts about whether his company will be able to offer services in two countries, citing issues such as having to pay corporation tax in two countries, different VAT rates and having to deal with another Ofcom type regulator in Scotland.

Their owner, Andrew Kennard states;

I wonder how many ISPs and other service businesses based in England would simply cut off Scotland, just because of simple commercial common sense

He also speculates on BT’s reaction if Scotland broke away from the rest of the UK, arguing that the cost of connecting remote communities might be passed on to customers without regulatory oversight.

Of course, no one really knows. It’s speculation. What we do know however is that a “Yes” vote will bring many changes in business processes and structure, all of which will have an impact to technology (arguably, all good for contractors!)

The real issue is what if the vote is a “No” but inconclusive? If the margin of victory is too close then there the noise will persist. Maybe we need to start planning a protracted and bitter divorce settlement.