Linear regression is a basic and commonly used type of predictive analysis. Linear Regression is a method to help us understand the relationship between two (Simple Linear Regression) or more variables (Multiple Linear Regression).
In this article, I am going to explore the data from Iowa Retail Liquor Sales which is available to access on Google BigQuery public datasets. BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a Platform as a Service that supports querying using ANSI SQL.
This is the query that I use to retrieve the dataset of Iowa Retail Liquor Sales 2012–2020 from Google BigQuery public datasets.
SELECT *, EXTRACT(ISOYEAR FROM iowa_liquor.date) as Year,
FROM `bigquery-public-data.iowa_liquor_sales.sales` AS iowa_liquor
WHERE EXTRACT(ISOYEAR FROM iowa_liquor.date) = 2020
AND (iowa_liquor.store_location IS NULL) = FALSE
AND (iowa_liquor.category IS NULL) = FALSE
Before we go on to analyze data, we have to understand the business in order for us to understand what is included in the dataset.
1. Business & Data Understanding
The Iowa Alcoholic Beverages Division is the alcoholic beverage control authority for the U.S. state of Iowa. Since March 8, 1934, it has regulated the traffic in, and maintained a monopoly on the wholesaling of, alcoholic beverages in the state, thus making Iowa an alcoholic beverage control state.
Alcoholic beverage control states generally called control states, are 17 states in the United States that, as of 2016, have state monopoly over the wholesaling or retailing of some or all categories of alcoholic beverages, such as beer, wine, and distilled spirits.
1.1 Iowa Liquor Retail Sales Dataset
This dataset contains every wholesale purchase of liquor in the State of Iowa by retailers for sale to individuals since January 1, 2012. The State of Iowa controls the wholesale distribution of liquor intended for retail sale, which means this dataset offers a complete view of retail liquor sales in the entire state. The dataset contains every wholesale order of liquor by all grocery stores, liquor stores, convenience stores, etc., with details about the store and location, the exact liquor brand and size, and the number of bottles ordered.
2. Exploratory Data Analysis
In this section, I am going to explore the dataset to collect as many insights as possible that can be used to improve the performance in the following year. For example, collecting information regarding which brand is the most popular, that allows the business to strategize or plan based on that insight for the following year.
2.1 Iowa Liquor Retail Sales per Year
Based on this line chart, it looks like the amount of liquor sold is increasing every year with the highest incline between 2016 to 2017. However, it starts to decline a little from 2019 to 2020. As seen on this graph, the amount of liquor sold in 2020 is lower than in previous years, it is due to the fact that there was a global pandemic around 2020 and still is up to this day.
2.2 Iowa Liquor Retail Monthly Sales from 2012–2020
Based on this multi-line chart, we found out that in October and December every year liquor sales are above 250000 USD, with only exceptions in October 2014, October 2016, and December 2016 where liquor sales are below 250000 USD.
The reason behind the huge amount of liquor sales occurred in October and December is because there are multiple holidays and festivals in the US during these months. One of the most popular in October is Halloween, but there is also National Hispanic Heritage Month, Columbus Day, and Indigenous People Day Celebration. In December, there is Christmas Day and also New Year Celebration.
2.3 Top 5 Types of Liquor Sold per Year
I have created 3 plots to visualize the top 5 types of liquor sold per year, but because of the size of the plot, and I don’t want to make this post to be very long, I will display the third plot which is the 2018–2020 top 5 types of liquor sold per year. The most common type of liquor sold from 2012 to 2017 are as follows:
- Straight Rye Whiskies (2012)
- Irish Whiskies (2013)
- Irish Whiskies (2014)
- Irish Whiskies (2015)
- Tequila (2016)
- Spiced Rum (2017)
For 4 years straight, from 2012 to 2015, whiskies were the most popular in Iowa and followed by Tequila in 2016. Then from 2017 onwards, both White Rum and Spiced Rum dominates the market in Iowa with the highest sales went above 1 million USD in 2017. From these insights, we can see that the demand in the market was shifting from Whisky to Rum.
2.4 Top 3 Brands From Most Common Types of Liquor Sold per Year
Same as the previous one (section 2.3), I also made 3 plots, but due to the size of it, I will only visualize the plot from 2018 to 2020.
- Straight Rye Whiskey (2012)
1st: Templeton Rye
2nd: Windmill Rye Whiskey
3rd: Cody Road Rye
- Irish Whiskey (2013 to 2015)
2nd: Cedar Ridge Single Malt Whiskey
3rd: Tullamore Dew Irish Whiskey
- Tequila (2016)
1st: Patron Tequila Silver
2nd: Jose Cuervo Especial Reposado Tequila
3rd: Juarez Tequila Gold
- Spiced Rum (2017)
1st: Captain Morgan Spiced Rum
2nd: Captain Morgan Spiced Barrel
3rd: Captain Morgan Original Spiced
In summary, it is certain that from each type of liquor, no matter what year it is, the brand is always the same. For example, the top brand for Irish Whiskey every year is always Jameson, Spiced Rum is always Captain Morgan, and for White Rum it appears that Bacardi is...Superior. (Pun intended.)
2.5 Top 100 Big Spenders in Iowa in 2020
The Capital city for Iowa state is Des Moines, and it is also a metropolitan city. Based on this information alone, I have no doubt that the huge amount of sales occurred in this county, but I am still curious whether that statement is true or not. So, I will use 100 rows of data from the ‘store_location’ column in the dataset to pinpoint the location of each store that has the most amount of transaction value.
To help me with this analysis, I am using an API called
Nominatim is an API that uses data to find locations on Earth by name and address. It can also do the reverse, find an address for any location on the planet. This method is called Geocoding.
As we can see based on the Folium map, Des Moines has the most frequent purchase in the top 100 sales data, so that also means the statement is true. The other cities that have a lot of purchases are Sioux City, Iowa City, Omaha, and Davenport
2.6 Correlation (Heatmap)
In the Iowa Retail Liquor Sales dataset, there is a column called ‘bottle_volume_ml’ which contains information regarding the volume of each bottle. This column has a categorical value ranging from 50 to 1750 ml. Before I create a Heatmap, I have to check what type of bottle volume is commonly used.
# Python code to find type of bottle volume commonly used
iowa_sales_df['bottle_volume_ml'].value_counts()[0:10] # Top 10
The output of the code above shows that the 3 most commonly used bottle volumes are 750ml, 1000ml, and 1750ml. I am going to include these 3 variables in the heatmap.
To find the correlation between each column, I am going to use the Pearson method.
# Create corr variable to save correlation of each column results
corr = iowa_sales_df[iowa_sales_df.columns[2:]].corr(method='pearson')
Based on the correlation value generated by the heatmap, we can see that in sale_dollars columns, the lowest value is 0.54 which is the correlation between ‘sale_dollars’ and ‘other_vol_sale_dollars’. I am going to use this dataset to build a Machine Learning — Linear Regression Model to predict sales in 2020.
3. Machine Learning — Linear Regression
To build a Machine Learning model, the dataset should be split into two parts which are the train & test set. The train set will be used for training the Machine Learning model, while the test set will be used to calculate the accuracy of the previously trained model. Our goal is to predict sales, so that means our target/dependent variable is the ‘sale_dollars’ column, while the other columns are our independent variables. Below is the python code to build the Linear Regression model.
The Linear Regression model predicts that in 2020 the liquor sales amount is USD 2,626,510.06, which is slightly higher than the real value (USD 2,588,159.34).
The Linear Regression model that we’ve build has a variance score of 98%. In other words, this machine learning model is able to predict the ‘sale_dollars’ based on variables from bottle volume sales of 750ml, 1000ml, and 1750ml. Based on this result, I am pretty sure that this model also able to predict the sales amount in 2021 by only using 750ml, 1000ml, and 1750ml bottle volume sales.
The reason why this model has such a high score is that the dataset is clean, meaning there is no noise (randomness) or even outlier contained in the dataset. Noise and outlier can affect the score and accuracy of the model. Also, the accuracy (variance score) will be lower when it is applied to the population. This data is only a sample that represents the whole records of Iowa Retail Liquor Sales data.
It is more reliable to predict sales based on bottle volumes rather than types of liquor (e.g. Vodka, Whiskey, Rum, etc.). Because, based on our findings in section 2.3 Top 5 Types of Liquor Sold per Year, it is clear that the most common types of liquor sold are changing every year.
Unfortunately in this dataset, invoice and item number are combined to create a unique ID. So that means, each row in the dataset has its own unique ID. There is no information in this dataset that only includes an invoice number. I would love to try and build Machine Learning — Recommender System to find paired items or bundles that people like to buy. For example, whenever a customer purchased a bottle of vodka, he/she also bought a pack of beer.
Thank you for reading this article!