EDA for Time Series
Exploratory data analysis is different for time series. This blogs aims to guide anyone that is looking to gain insight in the data before performing Time Series Analysis.
Data Exploration and Data Cleaning
To guide you through the process I will be using an excel file from a superstore.
Firstly the data needs to be loaded.
Now is time to explore the data, we will be doing that by using info() and isnull().sum()
There isn’t any missing data in the dataframe, this means that we can skip that process. In time series you only need the date and the column you want to explore. In our case we will usethe column category.
Grouping the data by category we can see that there are three categories. I will be using furniture.
Furniture
First, we need to create a new dataframe with only the data that has furniture as a category.
We now need to check how much data we have, this needs to be measured in terms of time, such as, 1 or 2 years.
Here we can clearly see that we have 4 years of data available.
As stated before in Time Series, we don’t need anymore data than the date and the column we want to explore. In this case Sales.
Let’s remove all the other columns from the dataframe:
Checking the first five rows of the dataframe we can see that the index is all over the place we need to fix this.
This step is not absolutely necessary, but getting in the habit of keeping your data organized
throughout any project is a good skill.
Indexing for Time Series
Time Series requires that the index is in the form of a date, in our example we will use Order Date as the index.
My index is daily, however, I want to use months instead of days, this will give me a better understanding on how the sales vary within the year.
Let’s check that worked, I am going to print all the data in the year 2015.
Now that the dataframe is ready for time series, we are going to dive into the EDA.
EDA for time Series
While I was working throughout different projects I realized that I ended up using the same piece of code all the time, therefore I created some functions that will create the plots I need.
Calling the above functions, we get the following:
Here we can see that there is some kind of seasonality towards the end of the year. Seasonality, are cycles that repeat regularly over time.
Let’s check this further: we can see sales by year.
Here we can appreciate that during the years 2014,2016 and 2017 there is an increase on the amount of sales of furniture towards the end of the year.
Using the heatmap we can also see that on the month 10 and 11 there is a massive increase on sales.
Conclusion
EDA for time series is realtively short, however preparing the data can be hard depending on how the data you have been given is shaped. Using pre-prepared functions can reduce the amount of time you spend writing code if you need to do Time Series modelling on more than one category (as it can be done in this example).