What is Data Science?
Data Science is the process and method for extracting knowledge and insights from large volumes of disparate data. It’s an interdisciplinary field involving mathematics, statistical analysis, data visualization, machine learning, and more.
Clustering in Data Science
Clustering is an unsupervised Machine Learning model algorithm used to group similar data points together and discover underlying patterns. Among other clustering models, k-means clustering is the most popular and easy to use.
Cluster is a group of objects that are similar to other objects in the cluster, and dissimilar to data points in other clusters. Clustering algorithms are divided into three parts:
- Partition-based Clustering. (e.g. K-Means, K-Median, Fuzzy C-Means)
- Hierarchical Clustering. (e.g. Agglomerative, Devisive)
- Density-based Clustering. (e.g. DBSCAN)
Nowadays, people utilize clustering algorithms to improve their business in all different fields. For example, to build a recommender system that can be used to recommend products or services for customers, or to build a model that can detect fraud by finding a specific/irregular patterns.
In this article, I am going to discuss and show you the example of implementing one of the clustering algorithms, K-Means Clustering, in a business idea. Also, I am using Python to help me demonstrate and to show the results.
Introduction & Business Idea
Indonesia is a country in Southeast Asia and Oceania, between the Indian and Pacific oceans. It consists of more than seventeen thousand islands, including Sumatra, Java, Sulawesi, and parts of Borneo (Kalimantan) and New Guinea (Papua). Indonesia is the world’s largest island country and the 14th-largest country by land area, at 1,904,569 square kilometres (735,358 square miles).
Indonesia is centrally-located along ancient trading routes between the Far East, South Asia and the Middle East, resulting in many cultural practices being strongly influenced by a multitude of religions. This is why Indonesia is rich in culture.
Indonesian cuisine is one of the most diverse, vibrant, and colourful in the world, full of intense flavour. Many regional cuisines exist, often based upon indigenous culture and foreign influences such as Chinese, European, Middle Eastern, and Indian precedents. Rice is the leading staple food and is served with side dishes of meat and vegetables. Spices (notably chilli), coconut milk, fish and chicken are fundamental ingredients. Some popular dishes such as nasi goreng, gado-gado, sate, and soto are ubiquitous and considered as national dishes.
2. Business Idea / Business Understanding
The business idea is to build Indonesian Restaurant that serves all kinds of Indonesian dishes. The idea is to introduce Indonesian cultures to the world using Indonesian cuisine as the media. In order for this business to be successful, or to be in-line with the idea, we have to find a location that allows our restaurant to gain recognition and exposure from foreign visitors that are visiting the country.
Before we go further into Data Preparation section, we have to know which city in Indonesia has the most foreign visitors in total. To do that, we have to find information regarding foreign visitors to Indonesia by point of entry.
I found this data set from the website provided by Statistics Indonesia.
As shown above, we can safely say that the main destination for tourist is Bali. To further the discussion from our Business Idea section, in order for the business to thrive, it needs to be recognized. The solution to that statement is to find a strategic location to build our restaurant in Bali, where there are multiple venues located such as Hotels, Villas, other restaurants, etc.
In this Data Preparation section, my goal is to collect all the informations regarding:
1. Cities/Regencies in Bali province.
2. Sub-districts from every city/regency.
3. Coordinates (latitude, longitude) from each sub-district.
4. Venue categories in 500m radius from each sub-district coordinate.
For goal number 1 & 2, I can gather those informations from the same website provided by Statistics Indonesia.
Next, to collect coordinates from each sub-district, I am going to use
geopy to access
# Import Libraries
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter# Using Nominatim geocoder API
geolocator = Nominatim(user_agent='Chris_P_Bacon_') # Crispy Bacon!!
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
bali_df['Location'] = bali_df['Sub-District'].apply(geocode)
.Map the result using
Once we’ve collected coordinates, we can now proceed to use those coordinates to collect informations regarding venues located 500m radius around each sub-district.
Foursquare is the best solution to collect information about venues location. It lets users search for restaurants, bars, shops and other places in a location. The app displays personalized recommendations based on factors that include the time of day, a user’s check-in history, their “Tastes” and their venue ratings. I am going to create a function that use Foursquare API to collect latitude and longitude of each sub-district.
Now the data is set and has all the informations required for the next section, which is Exploratory Data Analysis.
Exploratory Data Analysis
In this exploratory data analysis section, I want to explore the dataset to see the frequency of occurences of each venue category and sort them from 1st to 3rd most common venue in each sub-district. ‘Venue Category’ column in our dataset contains information that are useful for our Machine Learning. But unfortunately, the column’s datatype is string which can’t be used in the clustering algorithm.
There are 2 methods to work with categorical data, which is Label/Integer Encoding and One Hot Encoding (you can find further explanation regarding these 2 methods in this article). In our case, to avoid ordinal variable, I have to convert the category datatype from string to boolean (binary) using One Hot Encoding (
This is the
.head() of the dataframe after implementing one hot encoding method.
For example, if in radius 500m around Banjar has one or multiple ATMs located, the value in row ‘Banjar’ column ‘ATM’ would be 1 instead of 0.
After venue categories successfully converted, we can then proceed to further explore the dataset to find the frequency of occurences by using
.mean() function to our dataframe. It will count the average of venues occurred in each sub-district. For example, Banjar has 3 venues within the radius, which is McDonalds, KFC, and a Hotel. McDonalds and KFC are categorized as Fast Food Restaurant. So the frequency of occurences would be 0.66 (2/3) for Fast Food Restaurant, and 0.33 (1/3) for Hotel.
# Groupby 'Sub-District'
bali_grouped = bali_onehot.groupby('Sub-District').mean().reset_index()
To help us in reviewing the frequency of occurences more easily, I created new columns to store the information regarding 1st, 2nd, and 3rd most common venues in each sub-district. These columns are going to be useful later when K-Means clustering algorithm assigned cluster labels.
Now we have collected all the important informations that will be used for building the ML clustering model.
Machine Learning — Clustering Model (K-Means Clustering)
KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see here). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.
In ML clustering model, the first thing is to determine the optimal number of clusters (K) that will be generated by the model. One of the method to find the optimal number of clusters is elbow method. It is an approach used in determining the number of clusters in the dataset. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.
# Initiate object for clustering dataset
bali_clustering = bali_grouped.drop('Sub-District', 1)# Build clustering model to find the appropriate number of K
K = range(1, 6)
inertias =  # Empty list for inertiafor k in K:
# Building and fitting the model
KMeansModel = KMeans(n_clusters=k, random_state=4)
Elbow Method using Inertia
As shown above, it is safe to say that the optimal number of clusters (K) is 4. Now I am going to build the clustering model with K = 4 and apply the resulting cluster labels into each sub-district in the dataframe. Below is the result of the clustering model displayed in
Folium map, with sub-district markers superimposed on top.
The reason why the markers on the map do not resemble clusters is because the clustering model was build based on frequency of occurences from venues of each sub-district, and not from coordinates (latitude, longitude) of each sub-district
Let’s examine venue categories from each cluster label.
Comments on cluster 0 labeled as ‘Suburb’.
- Bali is a main tourist destination in Indonesia, it is common to see lots of venues even in suburb areas.
- Distinguishable by the numbers of conventional stores that occurred as the most common venue at almost every sub-district, it is also a common thing in suburb and rural areas to have lots of convenience stores here in Indonesia, not just in Bali.
Comments on cluster 1 labeled as ‘Urban’.
- The reason why many sub-districts labeled as urban compared to other cluster labels is because the amount of different types of venue categories.
- Based on my observation on the
Foliummap result, cluster 1 is mostly consist of restaurants and some other types of venue categories. But there are also many suburb areas labeled as cluster 1 (e.g. Look at those 5 sub-districts near ‘Gilimanuk’ on the top left side of the map.) , that is because restaurants exist even in the suburb areas.
Comments on cluster 2 labeled as ‘Recreation Site’.
- Bali is a small island compared to other islands in Indonesia (e.g. Sumatra, Kalimantan, Sulawesi and Java) but it has several mountains perfect for recreation sites.
- “Beaches also can be labeled as recreation site!”. True, but mostly, seashore areas in Bali are owned by hotels, villas, and restaurants. That is also the reason why seashores labeled as cluster 1 all over the map.
Comments on cluster 3 labeled as ‘Rural’.
It can be seen by how many sub-districts categorized ‘farm’ as their most common venue.
Based on the results, it is safe to say that I can build a restaurant in urban areas in Bali. Bali is a small island and it is a main tourist destination in Indonesia. There are no “uncrowded” areas in Bali, every city/regency is densely populated.
But let us go back to the Business Idea of this project. The goal of this project is to find the strategic location to build a restaurant, a location where there are alot of activities going on that makes it possible for the restaurant to gain some exposure and recognition. To answer that statement/idea, the best location to build a restaurant is in Denpasar.
Denpasar is the only city in Bali, others are regencies. It is obvious/easy to predict or assume that the answer to the Business Idea’s question on where is the best location to gain some exposure or recognition would be Denpasar. But that is the beauty of Data Science. As Data Scientists, we gain insights from data and make our conclusion based on the result of our analysis, and sometimes, insights from data tell us something we already know, but other times, it shows us something new and different that we can learn from.
This model is not perfect, there are still alot to improve. We can also use different algorithm such as DBSCAN and compare the results between these models. I think DBSCAN is a preferrable approach for this type of business idea since it was build exactly for this purpose (which is density based data), but K-Means is still a viable approach.
In this article, I just want to show the implementation of machine learning, and how it is functioned in real-life cases.
Hopefully I can show the implementation of different types of ML model in future articles. Perharps, to build DBSCAN clustering algorithm using the same business idea and compare its performance with K-Means clustering algorithm.