How to find value from messy data?
Sometimes during my projects I hear from my clients: “We’re collecting data. Let’s do something cool with it!”. It’s two popular ways to deal with it.
Scenario 1: start analysis following intuition. But to be honest – from the side it can seems like “black magic” and causing questions: Why you use this method? Why not that? And an effect is poor trust for results.
The better solution is Scenario 2: more formal approach.
CRISP-DM – the Data Science framework
My choice of framework is CRISP-DM (Cross Industry Standard Process for Data Mining). Set in 1996 but still works. And it’s still the most popular Data Science framework (according to kdnuggets survey).
Why I use it?
- This approach keeps business goal in mind during whole analysis process.
- Following every steps I create complete, reproductible and well documented analysis.
- I get answer for initial question.
6 steps to find value in data
This framework define 6 phases of data analysis process:
1. Business Understanding
- Don’t be afraid to ask questions. Talk with business team members.
- Check business context and set up goal of analysis.
- What you want to achieve during this analysis?
2. Data Unterstanding
- Collect dataset.
- Make some exploratory data analysis. Check distributions – some algorithms works only on variables in normal distributions.
- What missing value means? Is it error in data collecting or data processing?
3. Data preparation
- Check variables type (factor / continuous).
- Make some data cleaning.
- Create new variables if necessary (i.e. convert numeric to flag etc)
- Great book R for Data Science by Garrett Grolemund and Hadley Wickham can be very helpful to conduct data wrangling in R.
- Divide dataset to training and test subset
- Conduct data analysis with proper methods.
- Good practice is to train a few models and next decide which method gives the best output (the most accurate).
- Build matrix error to compare error rate of every model. Then decide to choose one the best model or use all models together and voting (i.e. if 3 of 4 models classify observation to group get this result)
- Very important step. Get result of analysis and take action. Make and share report. Build data-driven function in you app. Include this results in business model. Sometimes if you find next questions – start next CRISP-DM iteration.
Putting all together – this chart illustrate all process:
I had session on #9 Measure Camp London about this topic. Slides from my session are now available on SlideShare:
And some thoughts from my audience 🙂 Thanks!
— Richard Fergie (@RichardFergie) 10 września 2016