Welcome to the video tutorial for the AI Starter Kit on data exploration. In this first video, we will discuss why data exploration is the first step in answering any data science question.
Before starting to use complex algorithms to solve a data science problem, you need to gain a thorough understanding of the corresponding data and the underlying business challenge you are trying to address. Data exploration is the process through which you gain such an understanding. This will eventually enable you to derive viable working hypotheses related to the problem at hand, which is useful for the further analysis and modelling of the data.
Throughout this process you will be faced with different types of variables, for example numerical or categorical, which require different types of treatment. In addition, looking at individual variables in isolation is not sufficient, as this does not consider the, often complex, interactions between different variables. Furthermore, you will need to identify which are the relevant variables in your dataset and, on the other hand, which ones worsen its outcome rather than improving the quality of your analysis. Finally, data exploration will help you to point out which data quality issues your dataset may have and which actions you should undertake to mitigate them.
Amongst others, data exploration is useful to identify usual and unusual values for a variable but also, to study the evolution of a variable over time. It can be helpful to uncover significant relationships between different variables or to assess data quality and suitability.
The goal behind this Starter Kit is to lay out a series of analyses that will teach you how to explore individual variables, look at how pairs of variables are related and study the complex interactions between groups of variables. You will learn how to conduct a quantitative and visual inspection of your dataset and you will be prepared for the next step in your advanced analytics workflow.
In the tutorial of this AI Starter Kit, we follow a systematic approach to explore a dataset, based on the number of variables under consideration. We start by exploring individual variables in isolation, which is the simplest form of analysis and is called univariate analysis. We continue by studying pairs of variables, which is called bivariate analysis. And finally, we study the relationship between several variables, called multivariate analysis before we conclude on our findings. But before diving into this, in the next video, we will first introduce you to the dataset that we will use in this analysis and perform some basic preprocessing on it.
Authors: EluciDATA LabPermanent URL