Published: 31 July 2015
Why not five days? Or thirty days? What is so magical about the number ten? Well it boils down to two work weeks, a third of a month and we could go on but to be honest we just like the sound of ten. It comes off as being manageable and will not require a long-term commitment but there is a catch. Ten days will most likely represent a cycle with each day or couple of days representing a phase but there is no limit to the number of iterations for the cycle. You may need to repeat the cycle, say three times for example resulting in a total of thirty days, but this is dependent on your business objectives and your data.
So what exactly are we training our data for? To be more obedient? Are we training for responsiveness to our many requests? The simple answer is we are training our data to ensure maximum ROI for analysis, reporting, mining and visualisation.
Day 1: Prepare for your big, bad data
Get some rest, go on vacation if you can but ensure you are mentally and physically prepared for the task that lies ahead. Working with data is deeply dependent on the right tools as it is the right person. Ensure that your mind is clear, no preconceived biases and expectations. Allow the data to direct your steps and do not try to force the data into your mold but allow it to reveal itself to you. I know this may sound like a data voodoo exercise but people sometimes underestimate the prep work needed to tackle their data and end up overwhelmed. You will also need to prepare the right tools to get the job done. Ensure you have a good laptop, identify BI tools, and carry out database selection process, for example MySQL, Oracle, Microsoft etc.
Day 2: Determine Rules of engagement
Here you will lay down the foundation for data discovery. This is usually driven by business user requirements or at the very least answers the question of “what is the business problem you are trying to solve?” Do you have KPIs and properly defined business metrics? You will need to ensure you have these handy and assign priority to help guide the process and avoid wasting time on data that is of little or no value to the business. What if a financial institution wants to predict customers who will default on their loans or potential fraud risks? You would need access to payment history for credit cards, loans etc. Also data regarding their credit profile, spending patterns, income, age group, address and marital status would be useful in creating a holistic view of each customer and adds to the credibility of any model that you will create.
Day 3: Assemble the tools
You have already identified data needed and the tools you will be using to manipulate your data. Next, you will begin loading your database and prepping for querying and analysis. This step can sometimes be the most painful and may take longer than expected because data may be dirty, have inconsistent formats, or sometimes just absolute garbage. We can try as best as possible to cleanse data for loading but sometimes there will be the classic “GIGO” (Garbage In Garbage Out) cases and there is not much we can do about those except discard data or keep for possible future analysis.
Day 4: Solicit help
This is not to be taken lightly, a second pair of eyes can prove very useful when you have been looking at the same data for the last 24 hours. Maybe you are not savvy with SQL or analysis and you know someone who is and willing to help. Get their help, it should be an easier sell as you have already done the heavy lifting so they missed the fun part, they will thank you later.
Day 5: Deliver Quick Wins
By now you are fifty percent along this data journey and you should be well aware that you will need to segment your data and focus on specific subject areas to ensure you meet your deliverables. If we go back to our example of building a predictive model for fraud and delinquency, we already know where to focus our efforts for quick wins and we would not spend time analysing data that is not be relevant to our objective. Remember this should be more of an agile and iterative process and not an attempt to answer all the business problems in ten days, we are working to maximise stakeholder buy-in and user adoption.
Day 6: Document your data’s habits
I know this sounds weird and it may even be grammatically incorrect but you have to spend some time understanding the nuances of your data.
Are there NULL values? How will NULL values be treated?
Are the tables normalised?
Are the data types valid? Will you need to strip characters from what should be numeric fields?
Are products, services and customers uniquely identified? What about their attributes?
Be sure to document the good and the bad and devise a plan to handle them as best as possible. Real world data is not perfect data and this process helps to improve data consistency while giving you the opportunity to learn more about your data.
Day 7: If your data makes a mess be sure to clean up
By now you should realise that your data is not perfect and there will be a need to cleanse the data. There is no way to ensure your data will be one hundred percent cleansed and certainly we will not accomplish this in ten days, however we can work assiduously to cleanse our data and transform it into a better version of itself. For NULL values, you may want to ignore them or replace with a default value, for example 0 for numeric fields. Normalised tables can be combined into larger tables to avoid multiple query joins and thereby creating a more OLAP-like (Online Analytical Processing) structure. For invalid data types, database platforms have multiple inbuilt functions that can be used to strip characters from what should be numeric or date fields, determine the length of a field or simply check for a specific string.
Day 8: Be gentle
Let me reiterate, do not rush the process and do not let your preconceived ideas cloud the reality. We all have biases and based on our experience we may be leaning towards a particular result, however when working with data we have to be patient and allow the hidden truths to be revealed. It is important to note that a correlation between two things does not always mean that one causes the other as a third factor can be involved or simply random chance. Focus your efforts on understanding your data, identifying useful correlations, patterns and trends.
Day 9: Analysis, analysis and more analysis
The main goal of analysis is to discover useful information, suggest possible conclusions, and support decision-making. Training our data is the foundation to accomplishing this goal and now that we have spent the last eight days understanding and cleansing the data, we want to convert that data into information, information into knowledge, and knowledge into wisdom (DIKW Pyramid). Your brain will be bursting with information that will then need to be transformed into knowledge. You can start by asking yourself the following questions:
What do we know and understand about our data?
What patterns and trends have we detected?
Do we understand what is causing these patterns? What are the driving forces behind the trends?
How can we strategically use this information to accomplish our business objectives?
How can this knowledge solve our business problems?
How can we exploit this knowledge of the past to understand and predict the future?
Day 10: Show what new tricks your data can do
It’s the last day and now we get to have some fun. Time to show off your new tricks with stunning visualisations, statistical and predictive models. Data visualisation helps us to quickly and easily understand information that is being presented in a pictorial or graphical format as opposed to basic columns, rows, and text. You can use open source tools such as Dygraphs and D3.js, free trials of commercial tools like Tableau and Lumira, or use Excel and Google charts for visualisations. Tools like R and SPSS can be used for statistical and predictive modelling, helping companies to become more proactive and provide a huge competitive advantage. Over the past ten days you have developed a relationship with your data, you understand the anomalies and now you will need to know what type of information you want to communicate and have an appreciation of how your audience understands and processes visual information for maximum impact.
Training your data is not an overnight process and although we are ambitiously proposing ten days, we are aware that this ten day process may have to be repeated in multiple successions. The main aim of this post is to help you develop an appreciation for what is involved and get some pointers on how to approach your very own big data project.
About the author: Raquel Seville [@quelzseville] is a Business Intelligence Professional, SAP Mentor, BI Evangelist, Founder: exportBI | Co-Founder: eatoutjamaica. To find out more, please visit her about me page.