In this project, we will use the Foursquare API to explore neighborhoods in San Francisco, get the most common venue categories in each neighborhood, use the k-means clustering algorithm to find similar neighborhoods, use the Folium library to visualize the neighborhoods in San Francisco and their emerging clusters.
This is the clustering map we will get by the end of this project:
This piece is inspired by the Applied Data Science Capstone course on Coursera.
Prerequisite: Sign Up for Foursquare Developer Account
Foursquare is a technology company that built a massive dataset of location data. By communicating with the Foursquare database and making calls to its API, we can retrieve search for a specific type of venues or stores around a given location.
A free account is enough for this project. Once finished creating an account, click create a new App. Mark your Client ID and Client Secret for later use.
Step 1: Scrape Zipcodes and Neighborhoods from Website
Import libraries:
We scrape the following page, http://www.healthysf.org/bdi/outcomes/zipmap.htm, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.
We drop unnecessary columns and rows, and assigned data to a new dataframe sf_data
.
We have 21 rows in our dataframe.
Step 2: Convert Addresses into Latitude and Longitude
In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.
We will use uszipcode
, which is an easy to use zipcode database in Python.
Step 3: Explore Neighborhoods in San Francisco
1. Create a map of San Francisco with neighborhoods superimposed on top.
Use geopy
library to get the latitude and longitude values of San Francisco.
The geograpical coordinates of San Francisco are 37.7792808, -122.4192363.
We then use thefolium
library to plot the map. folium enables both the binding of data to a map as well as passing rich vector/raster/HTML visualizations as markers on the map.
You can find other more beautiful Matplotlib Colormap palettes here.
Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.
2. Define Foursquare Credentials and Version
3. Get the top 100 venues that are in each neighborhood within a radius of 500 meters.
Run the above getNearbyVenues
function on each neighborhood and create a new dataframe called sf_venues
.
1084 venues were returned by Foursquare.
Let's check how many venues were returned for each neighborhood:
Let's find out how many unique categories can be curated from all the returned venues.
There are 214 uniques categories.
Step 4: Analyze Each Neighborhood
We first convert Venue Category
variable into dummy variables.
Group rows by neighborhood and by taking the mean of the frequency of occurrence of each Venue Category.
Create a new dataframe return_most_common_venues
and display the top 10 venues for each neighborhood.
Step 5: Cluster Neighborhoods
k-means
is especially useful if you need to quickly discover insights from unlabeled data.
Run k-means to cluster the neighborhood into 5 clusters. k-means will then partition our neighborhoods into 5 groups. The neighborhoods in each cluster are similar to each other in terms of the features included in the dataset.
A brief debrief of parameters inside KMeans:
1. random_state
: KMeans is stochastic (the results may vary even if you run the function with the same inputs' values). In order to make the results reproducible, you can specify an int to make the randomness deterministic.
2. KMeans.fit()
: fit the KMeans model with the feature matrix.
Merge the origin sf_data
dataframe with the clustered neighborhoods_venues_sorted
one.
Finally, let's visualize the resulting clusters:
Step 6: Examine Clusters
We can now examine each cluster and determine the discriminating venue categories that distinguish each cluster.
Cluster 1
There are 16 neighborhoods in Cluster 1. We can easily notice that most neighborhoods in Cluster 1 have Coffee Shop
in their top 10 venues.
Cluster 2
There is only 1 neighborhood in Cluster 2.
Cluster 3
There is only 1 neighborhood in Cluster 3.
Cluster 4
There is only 1 neighborhood in Cluster 4.
Cluster 5
There is 1 neighborhood in Cluster 5.
Conclusion
Because there are only a few neighborhoods in San Francisco, we can't really get insightful clusters. However, you can try to play with bigger cities' datasets: Toronto, New York.