In this project, we will use the Foursquare API to explore neighborhoods in San Francisco, get the most common venue categories in each neighborhood, use the k-means clustering algorithm to find similar neighborhoods, use the Folium library to visualize the neighborhoods in San Francisco and their emerging clusters.

None
Project Flow

This is the clustering map we will get by the end of this project:

None

This piece is inspired by the Applied Data Science Capstone course on Coursera.

Prerequisite: Sign Up for Foursquare Developer Account

Foursquare is a technology company that built a massive dataset of location data. By communicating with the Foursquare database and making calls to its API, we can retrieve search for a specific type of venues or stores around a given location.

None

A free account is enough for this project. Once finished creating an account, click create a new App. Mark your Client ID and Client Secret for later use.

Step 1: Scrape Zipcodes and Neighborhoods from Website

Import libraries:

We scrape the following page, http://www.healthysf.org/bdi/outcomes/zipmap.htm, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

None

We drop unnecessary columns and rows, and assigned data to a new dataframe sf_data.

None
sf_data.head()

We have 21 rows in our dataframe.

Step 2: Convert Addresses into Latitude and Longitude

In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

We will use uszipcode, which is an easy to use zipcode database in Python.

None

Step 3: Explore Neighborhoods in San Francisco

1. Create a map of San Francisco with neighborhoods superimposed on top.

Use geopy library to get the latitude and longitude values of San Francisco.

The geograpical coordinates of San Francisco are 37.7792808, -122.4192363.

We then use thefolium library to plot the map. folium enables both the binding of data to a map as well as passing rich vector/raster/HTML visualizations as markers on the map.

None
Map of San Francisco with neighborhoods

You can find other more beautiful Matplotlib Colormap palettes here.

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

2. Define Foursquare Credentials and Version

3. Get the top 100 venues that are in each neighborhood within a radius of 500 meters.

Run the above getNearbyVenues function on each neighborhood and create a new dataframe called sf_venues.

None
sf_venues

1084 venues were returned by Foursquare.

Let's check how many venues were returned for each neighborhood:

None

Let's find out how many unique categories can be curated from all the returned venues.

There are 214 uniques categories.

Step 4: Analyze Each Neighborhood

We first convert Venue Category variable into dummy variables.

None
sf_onehot

Group rows by neighborhood and by taking the mean of the frequency of occurrence of each Venue Category.

None

Create a new dataframe return_most_common_venues and display the top 10 venues for each neighborhood.

None

Step 5: Cluster Neighborhoods

k-means is especially useful if you need to quickly discover insights from unlabeled data.

Run k-means to cluster the neighborhood into 5 clusters. k-means will then partition our neighborhoods into 5 groups. The neighborhoods in each cluster are similar to each other in terms of the features included in the dataset.

A brief debrief of parameters inside KMeans:

1. random_state: KMeans is stochastic (the results may vary even if you run the function with the same inputs' values). In order to make the results reproducible, you can specify an int to make the randomness deterministic.

2. KMeans.fit(): fit the KMeans model with the feature matrix.

Merge the origin sf_data dataframe with the clustered neighborhoods_venues_sorted one.

None
sf_merged

Finally, let's visualize the resulting clusters:

None

Step 6: Examine Clusters

We can now examine each cluster and determine the discriminating venue categories that distinguish each cluster.

Cluster 1

There are 16 neighborhoods in Cluster 1. We can easily notice that most neighborhoods in Cluster 1 have Coffee Shop in their top 10 venues.

None

Cluster 2

There is only 1 neighborhood in Cluster 2.

None

Cluster 3

There is only 1 neighborhood in Cluster 3.

None

Cluster 4

There is only 1 neighborhood in Cluster 4.

None

Cluster 5

There is 1 neighborhood in Cluster 5.

None

Conclusion

Because there are only a few neighborhoods in San Francisco, we can't really get insightful clusters. However, you can try to play with bigger cities' datasets: Toronto, New York.

Reference

  1. Applied Data Science Capstone
  2. uszipcode
  3. k-means
  4. What is meant by the term random-state in KMeans
  5. folium