in Data Science

Find signal in noise

How to find value from messy data?

Sometimes during my projects I hear from my clients: “We’re collecting data. Let’s do something cool with it!”. It’s two popular ways to deal with it.

Scenario 1: start analysis following intuition. But to be honest – from the side it can seems like “black magic” and causing questions: Why you use this method? Why not that? And an effect is poor trust for results.

The better solution is Scenario 2: more formal approach. 

CRISP-DM – the Data Science framework

My choice of framework is CRISP-DM (Cross Industry Standard Process for Data Mining). Set in 1996 but still works. And it’s still the most popular Data Science framework (according to kdnuggets survey).

Why I use it?

  • This approach keeps business goal in mind during whole analysis process.
  • Following every steps I create complete, reproductible and well documented analysis.
  • I get answer for initial question.

6 steps to find value in data

This framework define 6 phases of data analysis process:

1. Business Understanding

  • Don’t be afraid to ask questions. Talk with business team members.
  • Check business context and set up goal of analysis.
  • What you want to achieve during this analysis?

2. Data Unterstanding

  • Collect dataset.
  • Check what kind of data you have? What every variable means? Is there any technical circumstances of this data source? (i.e. in web analytics world major of tracking tools are based on cookies and JavaScript. If use has disabled it in browser – he wouldn’t be tracked and won’t be appear in dataset).
  • Make some exploratory data analysis. Check distributions – some algorithms works only on variables in normal distributions.
  • What missing value means?  Is it error in data collecting or data processing?

3. Data preparation

  • Check variables type (factor / continuous).
  • Make some data cleaning.
  • Create new variables if necessary (i.e. convert numeric to flag etc)
  • Great book R for Data Science by Garrett Grolemund and Hadley Wickham can be very helpful to conduct data wrangling in R.

 4. Modeling

  • Divide dataset to training and test subset
  • Conduct data analysis with proper methods.
  • Good practice is to train a few models and next decide which method gives the best output (the most accurate).

5. Evaluation

  • Build matrix error to compare error rate of every model. Then decide to choose one the best model or use all models together and voting (i.e. if 3 of 4 models classify observation to group get this result)

6. Deployment

  • Very important step. Get result of analysis and take action. Make and share report. Build data-driven function in you app. Include this results in business model. Sometimes if you find next questions – start next CRISP-DM iteration.

Putting all together – this chart illustrate all process:




I had session on  #9 Measure Camp London about this topic. Slides from my session are now available on SlideShare:


And some thoughts from my audience 🙂 Thanks!