Random perspectives on the analytics industry

Tuesday, January 14, 2014

Applications of Graph Theory

In Dec I had the opportunity of attending a seminar by a very interesting mathematician. Persi Diaconis was talking about Graph Theory at the ISI Bangalore and I had the opportunity of attending the session with one of my work colleagues. There are a couple of interesting elements about Diaconis that are worth knowing. He is a mathematician who is into magic and he was featured in a book by Alex Stone "Fooling Houdini" that I had managed to read during one of the rare moments that I get with books.

The talk itself was interesting in the sense that I got to hear about something on the edges of mathematics. Graph theory has interested me for more than a couple of years now but I am unable to get down to anything more than high-level information. In this case Persi identified some interesting properties that he believed graphs should possess and how some of these properties might not actually sit well with each other when you need to prove them.

In all honesty, I think this (in the world of Graph Theory) is where there will be significant development from an analytics perspective. They have obvious applications in the world of social networks and are seeing more usage in the world of consumer marketing which has long fascinated itself with understanding customer referenceability. Product basket analyses have been looking at these ideas for a while, where there is significant interest in what consumers shop for and how can you increase sales of non-essential items with bundling. We all have probably used one of the most important applications of graph theory (Page Rank algorithm from Google that tracks relevance of websites).

In recent times, there have been courses that showcase how to use these powerful techniques. Coursera has a class on "Social Network Analysis" that is good enough for you to get going on this journey and there was another class on "Probabilistic Graphical Models" which is for the more advanced user of the idea. Given all the applications in real life, there is significant potential for this tool to move people away from the single-minded focus on regression paradigms in a nice way.

Sunday, October 20, 2013

Different types of regression

I have always felt that regression is a very versatile tool. It can be used for measurement (to explain what happened), for analysis (to understand drivers) and for forecasting. It has a long history and still has relevance in our analytical suite of tools.

Some of the evolution of regression is very interesting from the perspective of how shortcomings have been addressed. Some of the main arguments / shortcomings against regression are that it does not handle multicollinearity well (especially when you need driver analysis) and some of the assumptions (like the independence of the errors and the explanatory variables) that never seem to be satisfied. Research on these dimensions have led to improvements in methods that can handle these issues. There are three interesting ideas that I want to highlight in this week's blog post.

There are many ways to handle multicollinearity in analysis. It's importance is driven by the fact that when one needs to measure the impact of key variables, it needs to be independent of other variables that could bias the measurement. Principal component analysis and factor analysis are options to handling multicollinearity but there are significant challenges in interpreting results after that. Latent class is a good way of handling this (and I will be discussing this in the future). Ridge (and Lasso) regression is a simple idea of handling multicollinearity in regression. Conceptually in Ridge regression, to handle multi-collinearity in the data, bias is introduced in the data. This has the effect of reducing the variance in the data which leads to better estimates from an analysis perspective.

One other disadvantage of least squares regression is it's lack of flexibility. Variable transformations and interactions do add a lot of flexibility but there is one technique that adds a lot more flexibility. Local regression (also known as LOESS regression (or LOWESS - locally weighted least squares)) adds the flexibility that many machine learning techniques have. It does bring in some elements of computational intensity required to handle this but can add the flexibility to deliver interpretable results. Local regression basically creates local subsets to build models on and can hence manage very non-linear relationships well.

One interesting issue in regression usage has been the difficulty in dealing with counter-intuitive results. Bayesian Regression provides an approach to formulate hypothesis that can be incorporated into the regression analysis. This can help bring in prior knowledge to play an important role in the analysis while minimizing very counter-intuitive results. Of course, as with all regression techniques, the modeler will need to use his intelligence to get to the best models.

In any case, there is a lot more to regression than meets the eye!

Tuesday, October 8, 2013

R and Shiny

After my previous post on Julia, I wanted to get back to R to ensure that I have explored everything it has to offer. In an attempt to learn something new, I decided to take on the world of HTML5 and Javascript and visualization and teaching. All this came together in one single R package called Shiny. This package is quite neat as it allows you to create web applications for statistical analysis. In the interest of learning something new and being able to teach something that I like, I decided to create a web application for power analysis.

It might be a simple thing for many folks but I wanted to showcase the power of Shiny along with some new found knowledge that I gained about R.

First Shiny - It has two sides, a UI side which defines the front end and Server side which hosts the R program that runs on the background. The UI side has the layout and inputs element of the webpage and the Server side generates the output that is needed. You typically work the statistical side on the Server side. On my localhost, it was quite fast and you do not notice the lag in terms of changing inputs and observing outputs.

A few things that I learned in this process which are slightly ancillary to Shiny!

1. How to plot multiple elements in a single graph. The one that I have here has about 6 element in the graph.
2. How to get Greek letters to work in R. Did not know that it could be done but figured it had to be the case as this at the end of the day is a statistical package.
3.How do I actually demo the impact of sample size and significance level to students and show them that it is not always about having alpha = 0.01 that makes it better.

The big deal about Shiny is that it enables you to have discussions around analysis with your clients / partners in a very interactive manner. This allows you to explore the full dimension of the analysis with the business partner and hence get to better decisions in the long run. In the near future, it has helped me build something that I thought needed to be built. I am going to do more to showcase regression results. There are a few examples here that are worth exploring.

Wednesday, October 2, 2013

Another Statistical Language

From a talk that I recently attended, I learnt about a new statistical language. Now the initial question that I had before attending the talk was why do I need a new language. However even after the talk, I could not really get a good handle on the answer to that. Even though a lot of analytics professionals do not look at SAS as a statistical language, there are those among us who are quite comfortable with that idea and can live with R and SAS. So why do we need another language?

The talk by itself was relatively interesting. The language was Julia and the speaker was Viral Shah who was one of the founders of the language. Since the perspective from the founder of anything is about why they did something, it usually makes for an interesting talk. I learned interesting stuff about the different elements of rating (or evaluating) a programming language. These elements can change as hardware and technology improve. (Hence you could always see the arrival of new languages in the future.)

The first thing of interest in the talk was the fact that there exists a million (well may not be that many) languages out there. They differ from each other in some fashion or the other making them the preferred language for some and the not so preferred for others. Some like C and Fortran have history (and speed) associated with them. Others like Matlab (and Ocatave from the open source world) have some mathematical flavor in their workings. Others like S and R which have a stats flavor that have their own followers. It makes for a very different world and at some level it does not enable people to talk to each other. This is apart from the typical Stat analysis packages like SAS, TREENET, SYSSOFT etc. that people use for day to day analysis and data manipulation.

Anyways Julia is supposed to be a new paradigm in technical computing. It does have some noteworthy features including beating the crap out of other packages on key speed benchmarks and is open source but I just lost track of some of the other features. It is faster as it has a JIT compiler (to be honest, I am not sure how this helps and I am not even sure if this is why it is faster) but it does not run into interpretation challenges that R has (at least from my understanding). It is optimized for parallel computing (and I thought even R had it but now I am not sure!) There are other features I am sure but what is interesting is how the community around this is growing. They already have more than 175 packages as far as I understand in a span of less than a year going public!

It looks like there is going to be multi-core support coming soon as well as some level of support for GPU computing. The question is whether the world would have moved on since! I want to think that this is the day of everything happening online and so there will soon be a world where you do not have to download anything. You just work with your browser and you are set (which makes trying new software a cinch!)! I am not sure where that leaves me though. I am still playing with R to the extent that everyday feels like I have discovered new features of a toy (wait for my next post!)! Not sure how to make the switch!

Saturday, September 21, 2013

Analytics education... What is the best way to get ideas across?

While trying to figure out what to do in life, I am in the process of exploring the analytics education space. Given my background in this space, and the goal of retiring into this job, I have been thinking about this for a while. However I have not done a whole lot till in this space other than do some trainings in a haphazard manner till the beginning of this year.

I am a firm believer in the idea that if you want to learn something you need to teach it (hopefully to a bunch of interested folks). I have developed some perspectives on what it takes to be a good business analyst and there is significant element of business inputs needed for that. However the quantitative element is very important as business intuition takes time time to develop. The focus on this element is minimal though in most education programs until recent times. MBA programs are now introducing more rigorous quant subjects and can actually make a full time quant MBA a possibility soon. Is it possible to make it better now? It feels like it should be available in college also and not just in specialized MBA programs.

In the interest of my personal journey, I have decided to do a couple of things. Take a step back and do some learning. I have signed up for a couple of topics in Coursera to see how it feels to learn something in a new world. I might be challenged also from a pure discipline perspective but I will need to try. The second thing that I am trying to do is evaluate different mediums from the perspective of analytics learning. My particular interest would be in quantitative subjects but this will be an interesting experience for me to explore other areas also as there will be significant learnings that I am hoping to get. These different mediums span the breadth of technology from the plain classroom to the Android / Windows / Iphone app. There seems to be a sea change in this world from the time I studied many of these subjects.

I have been talking to the director of a leading MBA education institution in Bangalore about conducting training at their location. This scenario seemed like a good place to start looking at how outsiders take to analytical education as compared to insiders. (My perspective is when you pay for it, you are more than willing to learn but if it is free then who cares - I am a shining example of this). A colleague of mine introduced me to a company that does coaching. I need to see how that will work out but I am struggling to get my thoughts on my future in order. It looks like there is some potential to do some interesting stuff there too.

Friday, September 13, 2013

Treenet and Stochastic Gradient Boosting

While I am someone who likes to go deep into techniques, I rarely get the time (maybe some intellectual honesty would require me to say that I get distracted easily by what is happening around me) to understand something technical. However there is an expectation that "I get it faster!" So for this week, I wanted to actually get deep into one such thought process and bring out concepts into simple intuitive ideas.

This week I want to get into Stochastic Gradient Boosting. Purely because I understand it well enough to explain I guess but another reason that is a little more personal is the fact that I had a dinner conversation with the inventor of this methodology. Jerome Friedman made a presentation at and event at my previous job and since I was organizing the event I was able to meet up with him for dinner along with a few other colleagues. It always feels good to be with statistical royalty and these folks are quite down to earth. While dinner was good, the conversation was better as we learned about his Princeton days when he was colleagues with John Nash the famous Nobel Prize winning economist.

Anyways coming to the key idea of this blog!!! Stochastic gradient boosting is an approach used to improve supervised learning methods. In a typical classification problem accuracy needs to be improved without overfitting the data. With any algorithm, all one can typically do is come up with better features to improve the model. There is significant learning to be had from classification error though. Wherever error is high, there is an opportunity for improvement. Modeling the error (based on any algorithm that you have already used to get this far) will allow you to further reduce it. However, there is one problem to watch out for. Errors are technically independent of the model being developed and hence we need to watch out for spurious relationships. Penalizing the error reduces the impact of these variables being able to significantly impact the analysis unless there is value coming through them being in the model. This in a nutshell is SGB and Treenet is a commercial implementation of this for decision tree algorithms.

R also has an implementation of Stochastic Gradient Boosting. Actually it has many. The GBM package is a good place to start as it has simple implementation of Bagging and one can start exploring more advanced packages that implement boosting for other algorithms including regression (l2boost) and SVM (wSVM).

I guess as a next step I should read Jerome Friedman's paper and synthesize this!

Monday, September 9, 2013

Analytical software for analysts - are they way too complex?

Are there analytical software out there that actually make learning from data intuitive? I have experience with quite a few of these packages but none of them are intuitive for the average business analyst without making them useless after looking at data in one or two dimensions. While this is good for business, I must admit that it makes life difficult as the problems one has to tackle get quite mundane when responding to queries from the not so statistically literate.

What would be the ideal requirements for one to actually be able to get ideas from data? Let us assume that the average user has a sense of the business he / she is dealing in. At the end of the analysis he should be able to get a sense of how to drive the business forward or at least has a good sense of what are some of the drivers that would explore further. Let us further assume that the average business user also has the ability to understand counter-intuitive results and can basically understand two dimension analysis and can possibly understand three dimension analysis but will be unable to move forward beyond that.

Ideally when my business problems are well-defined (in the sense that I at least know what I want to solve initially even though I might realize that I need to solve something much larger later), then these tools should be able to at least drive some initial value for the analysts by incorporating these business requirements. But when I am sifting through data without a clue as to what I am looking for, how do I identify patterns that are meaningful and at the same time not require me to be in that business domain forever?

Regression analysis required significant understanding of the statistics to be able to confidently drive the analysis. CART / CHAID type algorithms are relatively easier to understand but I am not sure if there are decent implementations of a software that makes the learning from CHAID / CART intuitive. Bayesian networks or topological data analysis might be an answer but I have not worked enough with these to have a viewpoint on the implementation perspective. These are good with identifying patterns but do not necessarily make it easier for the business to get their reads better.

Ultimately I believe business problems need to be solved with the business context in mind and there are no general software that will enable that. Is it time for one to be created?