Sometimes during my projects I hear from my clients: “We’re collecting data. Let’s do something cool with it!”. It’s two popular ways to deal with it.
Scenario 1: start analysis following intuition. But to be honest – from the side it can seems like “black magic” and causing questions: Why you use this method? Why not that? And an effect is poor trust for results.
The better solution is Scenario 2: more formal approach.
CRISP-DM – the Data Science framework
My choice of framework is CRISP-DM (Cross Industry Standard Process for Data Mining). Set in 1996 but still works. And it’s still the most popular Data Science framework (according to kdnuggets survey).
Why I use it?
This approach keeps business goal in mind during whole analysis process.
Following every steps I create complete, reproductible and well documented analysis.
I get answer for initial question.
6 steps to find value in data
This framework define 6 phases of data analysis process:
1. Business Understanding
Don’t be afraid to ask questions. Talk with business team members.
Check business context and set up goal of analysis.
What you want to achieve during this analysis?
2. Data Unterstanding
Make some exploratory data analysis. Check distributions – some algorithms works only on variables in normal distributions.
What missing value means? Is it error in data collecting or data processing?
3. Data preparation
Check variables type (factor / continuous).
Make some data cleaning.
Create new variables if necessary (i.e. convert numeric to flag etc)
Great book R for Data Science by Garrett Grolemund and Hadley Wickham can be very helpful to conduct data wrangling in R.
Divide dataset to training and test subset
Conduct data analysis with proper methods.
Good practice is to train a few models and next decide which method gives the best output (the most accurate).
Build matrix error to compare error rate of every model. Then decide to choose one the best model or use all models together and voting (i.e. if 3 of 4 models classify observation to group get this result)
Very important step. Get result of analysis and take action. Make and share report. Build data-driven function in you app. Include this results in business model. Sometimes if you find next questions – start next CRISP-DM iteration.
Putting all together – this chart illustrate all process: