Tuesday, August 27, 2013

Machine learning ... will it tell me the future?

I recently got pulled into a conversation with one of our really smart analysts. He has a dreamy vision of getting into bureaucratic India and at the same time is really smart when it comes to coding. We got talking and he started describing this competition on Kaggle. While I heard him out and was also getting a bit excited about doing something hands on, I realized that the industry has really grown around me and I have not had the chance to appreciate the growth.

Kaggle is one of many websites that offer competitions. KD Nuggets, Analytic Bridge etc. are other sites that have these competitions. What is interesting is where the the solution approaches seeming to be heading to. A few years ago (or maybe many years ago - and it can be a separate blog topic), we (at grad school) discussed the coming of age of machine learning. Given large amounts of data, how can you get accurate predictions for different problems. With the focus being only on predictions, these algorithms were able to meet many statistical techniques purely due to the lack of any constraints that a data generating model would impose on a statistician. Why are we constraining ourselves from a hypothesis perspective? Has statistics lost out on the chance to be the next cool thing in the world and will machine learning take over? It makes sense to understand why is one even relevant in this day and age.

Many machine learning algorithms are inherently black box purely because of the way the characteristics relate to the object that needs to be predicted. While there are ways of understanding which characteristics are important and associated sensitivities, there is potential for it to be misleading if not diagnosed properly. Most machine learning techniques have a significant validation component to ensure that the algorithms are robust and can handle exception cases.

Where does this lead us to? One of the most interesting expectations from machine learning is we can live in a IRobot kind of environment where machines can predict survival rate based on their learning. Google has designed an algorithm that can identify cats (even though I am not sure what it would call it) and there is potential for machines to get smarter with time.

BTW here is a plug for one more analytics competition. Should be fun if you are in college!!! (It is quite rewarding from a financial perspective!)

Monday, August 19, 2013

Statistical paradoxes and their paradoxes!!!!

This happened at work. One of the questions posted by one of our brighter analysts was about Freedman's paradox. We see a lot of work going on in the predictive modeling space across industries where a bunch of variables are being thrown together to get some predictive power on some dependent variable. Sometimes we will get variables which are not related but have some predictive power (hence the paradox). What this made me do was to see what other paradoxes are around in statistics.

A lot of us are familiar with Simpson's paradox. (I will write about this later as this is one of my favorite topics), and Freedman's paradox is related to this one in the sense that we are getting strong relationships when we should expect none.

 A bunch of other paradoxes are a lot more interesting possibly due to my relative ignorance of them. Wikipedia has a link to some of these and I will probably do some in-depth research on these once I am at a better place. 

Lindley's paradox - Very interesting paradox that gives us how we can reach different conclusions based on the same data and the same hypothesis
False positive paradox - When your infection rate is about 1% and you have a test that is 90% accurate in detecting the infection and you test 20% of the population, then your test would have told about 2% of the population that they have the infection when in reality they do not have it and only 0.02% of the population that they actually have the infection correctly. That means almost 90% of the people that you told have the infection actually do not have the infection!!!

Another link that I did some reading on was also good. There are a lot more paradoxes and the mathematical / probability literature has a lot more of them. It would make fun reading for the analytically curios because you would need the ability to recognize these issues when you see them in your day-to-day work.

Monday, August 12, 2013

Can regression measure everything?

For a long time, I have been pushing for a better understanding of regression. Regression gives us insight into the multivariate relationship that exists in the world. It is difficult to visualize these relationships as the number of dimensions can exceed human imagination. For all that complexity though, regression is an ancient concept (by measure of the speed at which new techniques come into the analytics industry). Why has it not been adopted to understand the world a lot more?

I think due to the complexity in visualizing these relationships, there is resistance to using these ideas. I am adamant that people who show me anything think along those lines. There is a chance that some of these insights can be developed based on individual analyses slowly. It will be a challenge to ensure that you can highlight everything.

While people look at regression to tell us what will happen, I believe regression is a tool that is best used for measurement. The more complex the relationship, the more important is to ensure that we get the measurement framework right. The measurement of the impact of engine size on mileage is straightforward, but the measurement of the marketing spend on TV on its impact on sales is not so easy. Due to significant relationships between may contributing factors, teasing out the impact of TV marketing spend is a challenge that marketers have tried to solve with no easy solution. While these aspects might pose challenges to using a regression framework, there are a couple of other places where regression may be misleading.

1. When we have non-linear temporal relationships, a straightforward regression approach will not measure relationships accurately leading to misleading diagnoses.
2. When there is a feedback loop, regression usage might even lead to counter-intuitive relationships. While these relationships may not be difficult to recognize, they need to be measured with other techniques to get the right perspective.

Sunday, August 11, 2013

Why another blog on analytics?

I have been meaning to start this blog for a long time. The analytics industry has matured significantly since I got into my undergrad at Loyola College in Chennai to study statistics. While how I got into this industry is a story by itself, I feel like we are at the cusp of a new revolution in the world which will change the way we (as human beings) operate in the world.

As I read up on these small steps that contribute to this revolution, I want to document this for myself. Since I am aiming to retire into academia, hopefully this will enable me to build enough tidbits of information that will enhance the learning process of myself (and others!)