Data Science — Finding The Strategic Location to Build Restaurant in Bali Using K-Means Clustering

What is Data Science?

Data Science is the process and method for extracting knowledge and insights from large volumes of disparate data. It’s an interdisciplinary field involving mathematics, statistical analysis, data visualization, machine learning, and more.

Clustering in Data Science

Clustering is an unsupervised Machine Learning model algorithm used to group similar data points together and discover underlying patterns. Among other clustering models, k-means clustering is the most popular and easy to use.

Temple in Bali | Photo by Harry Kessell on Unsplash

Introduction & Business Idea

1. Introduction

Indonesia is a country in Southeast Asia and Oceania, between the Indian and Pacific oceans. It consists of more than seventeen thousand islands, including Sumatra, Java, Sulawesi, and parts of Borneo (Kalimantan) and New Guinea (Papua). Indonesia is the world’s largest island country and the 14th-largest country by land area, at 1,904,569 square kilometres (735,358 square miles).

2. Business Idea / Business Understanding

The business idea is to build Indonesian Restaurant that serves all kinds of Indonesian dishes. The idea is to introduce Indonesian cultures to the world using Indonesian cuisine as the media. In order for this business to be successful, or to be in-line with the idea, we have to find a location that allows our restaurant to gain recognition and exposure from foreign visitors that are visiting the country.

Data Preparation

Before we go further into Data Preparation section, we have to know which city in Indonesia has the most foreign visitors in total. To do that, we have to find information regarding foreign visitors to Indonesia by point of entry.

Foreign Visitors to Indonesia by Point of Entry (Data)
Foreign Visitors to Indonesia by Point of Entry (Matplotlib Plot)
# Import Libraries
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
# Using Nominatim geocoder API
geolocator = Nominatim(user_agent='Chris_P_Bacon_') # Crispy Bacon!!
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
bali_df['Location'] = bali_df['Sub-District'].apply(geocode)
Map of Bali (Folium)
Function to create DataFrame to store venue latitude and longitude

Exploratory Data Analysis

In this exploratory data analysis section, I want to explore the dataset to see the frequency of occurences of each venue category and sort them from 1st to 3rd most common venue in each sub-district. ‘Venue Category’ column in our dataset contains information that are useful for our Machine Learning. But unfortunately, the column’s datatype is string which can’t be used in the clustering algorithm.

One Hot Encoding output dataframe
# Groupby 'Sub-District'
bali_grouped = bali_onehot.groupby('Sub-District').mean().reset_index()
Adding 1st, 2nd, 3rd Most common columns

Machine Learning — Clustering Model (K-Means Clustering)

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see here). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

# Initiate object for clustering dataset
bali_clustering = bali_grouped.drop('Sub-District', 1)
# Build clustering model to find the appropriate number of K
K = range(1, 6)
inertias = [] # Empty list for inertia
for k in K:
# Building and fitting the model
KMeansModel = KMeans(n_clusters=k, random_state=4)
KMeansModel.fit(bali_clustering)
inertias.append(KMeansModel.inertia_)
Elbow Method Plot
Map of Bali with Cluster Labels (Folium Map)
Cluster 0 — Suburban
Cluster 1 — Urban
Cluster 2 — Recreation Site
Cluster 3 — Rural
  • Distinguishable by the numbers of conventional stores that occurred as the most common venue at almost every sub-district, it is also a common thing in suburb and rural areas to have lots of convenience stores here in Indonesia, not just in Bali.
  • Based on my observation on the Folium map result, cluster 1 is mostly consist of restaurants and some other types of venue categories. But there are also many suburb areas labeled as cluster 1 (e.g. Look at those 5 sub-districts near ‘Gilimanuk’ on the top left side of the map.) , that is because restaurants exist even in the suburb areas.
  • “Beaches also can be labeled as recreation site!”. True, but mostly, seashore areas in Bali are owned by hotels, villas, and restaurants. That is also the reason why seashores labeled as cluster 1 all over the map.

Conclusion

Based on the results, it is safe to say that I can build a restaurant in urban areas in Bali. Bali is a small island and it is a main tourist destination in Indonesia. There are no “uncrowded” areas in Bali, every city/regency is densely populated.

Data Analyst/Scientist | Github: jonando93