Disclaimer: this article has been generated as part of IBM Data Science Professional Certificate course’s final submission.
Then we can use
BeautifulSoup to scrape boroughs from Wikipedia. Then we have collected every link given in neighborhood column of the table. From each link, we can run iteration via requests to visit those Wikipedia pages, and scrap population data from right hand side table.
We can combine data frames from previous steps into one based on “neighborhood” and “borough”:
Here is a box chart of “Population” per “borough”:
Also, another box chart of “neighborhood” per “borough”:
After collecting population data, now it is time to collect the hospital data. We can use the Foursquare API to fetch hospital data for latitude and longitude of each neighborhood from the previous dataset.
We can also collect hospital bed related data from NYS Health Profile website. We can scrap data by using
BeautifulSoap. We have collected the IDs of hospitals in NYC manually, and based on those IDs, we have scraped data from NYS Health Profile website. The data frame looks like this:
Now we are going to combine data from step four and step five. We are going to internally join the data frame based on “neighborhood” and “borough”.
We are going to clean up the data a little bit and sum up bed count and icu bed count grouping by “neighborhood” and “borough”:
Here is a box charts of “bed count” per “borough”:
Also another box charts of “ICU bed count” per “borough”:
Now we are going to combine data from step three and step six. Means, we are going to combine the population data with hospital bed count data. We are going to merge two data frames based on “neighborhood” and “borough”. New data frame looks like this:
Now we are going to calculate bed per hundred people based on two rows:
Bed Number. Then add this to the data frame. Similarly, we are going to add ICU data to data frame:
Now we are going to use k-means clustering to partition the data into k** groups. we will be using
elbow method** to find the optimal number of k. The “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point. In the visualizer “elbow”, value of k is 3.
After that, we are going to merge cluster labels of groups with data frames. The data frame looks like this:
Let us see which boroughs belong to which clusters.
Here is the dataset for cluster 0:
Here is the dataset for cluster 1:
Here is the dataset for cluster 2:
So far, we have analyzed dataset for neighborhoods with hospitals. Now, we should look into neighborhoods without hospital data:
If we see the indexes of neighborhoods with and without hospital, it should look like this:
We can see that there are 100 neighborhoods which does not have any hospital.
In next article, we are going to visualize the data collected from previous steps and discuss about our results.
If you like this article, you can buy me a coffee. Thanks!
Last updated: Nov 20, 2023