Data Clustering in San Francisco Neighborhoods

In this project, will use the Foursquare API to explore neighborhoods in San Francisco.

Vanessa Leung

Towards Data Science

· ~5 min read · September 29, 2019 (Updated: December 12, 2021) · Free: No

In this project, we will use the Foursquare API to explore neighborhoods in San Francisco, get the most common venue categories in each neighborhood, use the k-means clustering algorithm to find similar neighborhoods, use the Folium library to visualize the neighborhoods in San Francisco and their emerging clusters.

Project Flow

This is the clustering map we will get by the end of this project:

This piece is inspired by the Applied Data Science Capstone course on Coursera.

Prerequisite: Sign Up for Foursquare Developer Account

Foursquare is a technology company that built a massive dataset of location data. By communicating with the Foursquare database and making calls to its API, we can retrieve search for a specific type of venues or stores around a given location.

A free account is enough for this project. Once finished creating an account, click create a new App. Mark your Client ID and Client Secret for later use.

Step 1: Scrape Zipcodes and Neighborhoods from Website

Import libraries:

We scrape the following page, http://www.healthysf.org/bdi/outcomes/zipmap.htm, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

We drop unnecessary columns and rows, and assigned data to a new dataframe sf_data.

sf_data.head()

We have 21 rows in our dataframe.

Step 2: Convert Addresses into Latitude and Longitude

In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

We will use uszipcode, which is an easy to use zipcode database in Python.

Step 3: Explore Neighborhoods in San Francisco

1. Create a map of San Francisco with neighborhoods superimposed on top.

Use geopy library to get the latitude and longitude values of San Francisco.

The geograpical coordinates of San Francisco are 37.7792808, -122.4192363.

We then use thefolium library to plot the map. folium enables both the binding of data to a map as well as passing rich vector/raster/HTML visualizations as markers on the map.

Map of San Francisco with neighborhoods

You can find other more beautiful Matplotlib Colormap palettes here.

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

2. Define Foursquare Credentials and Version

3. Get the top 100 venues that are in each neighborhood within a radius of 500 meters.

Run the above getNearbyVenues function on each neighborhood and create a new dataframe called sf_venues.

sf_venues

1084 venues were returned by Foursquare.

Let's check how many venues were returned for each neighborhood:

Let's find out how many unique categories can be curated from all the returned venues.

There are 214 uniques categories.

Step 4: Analyze Each Neighborhood

We first convert Venue Category variable into dummy variables.

sf_onehot

Group rows by neighborhood and by taking the mean of the frequency of occurrence of each Venue Category.

Create a new dataframe return_most_common_venues and display the top 10 venues for each neighborhood.

Step 5: Cluster Neighborhoods

k-means is especially useful if you need to quickly discover insights from unlabeled data.

Run k-means to cluster the neighborhood into 5 clusters. k-means will then partition our neighborhoods into 5 groups. The neighborhoods in each cluster are similar to each other in terms of the features included in the dataset.

A brief debrief of parameters inside KMeans:

1. random_state: KMeans is stochastic (the results may vary even if you run the function with the same inputs' values). In order to make the results reproducible, you can specify an int to make the randomness deterministic.

2. KMeans.fit(): fit the KMeans model with the feature matrix.

Merge the origin sf_data dataframe with the clustered neighborhoods_venues_sorted one.

sf_merged

Finally, let's visualize the resulting clusters:

Step 6: Examine Clusters

We can now examine each cluster and determine the discriminating venue categories that distinguish each cluster.

Cluster 1

There are 16 neighborhoods in Cluster 1. We can easily notice that most neighborhoods in Cluster 1 have Coffee Shop in their top 10 venues.

Cluster 2

There is only 1 neighborhood in Cluster 2.

Cluster 3

There is only 1 neighborhood in Cluster 3.

Cluster 4

There is only 1 neighborhood in Cluster 4.

Cluster 5

There is 1 neighborhood in Cluster 5.

Conclusion

Because there are only a few neighborhoods in San Francisco, we can't really get insightful clusters. However, you can try to play with bigger cities' datasets: Toronto, New York.

Reference

#data-science #data-analysis #data #data-analytics #data-visualization