Integrating into Pipelines for Clustering

Starting off, we import a few required Python resources. While there are quite a few in there, notice we are grabbing the make_pipeline method from Sci-Kit Learn. We are going to be building pipelines!

[1]:
import os
from pathlib import Path
import sys

from arcgis.features import GeoAccessor
from arcgis.geoenrichment import Country
from arcgis.gis import GIS
from dotenv import find_dotenv, load_dotenv
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline

# load the "autoreload" extension so that code can change, & always reload modules so that as you change code in src, it gets loaded
%load_ext autoreload
%autoreload 2
[2]:
# load environment variables from .env
load_dotenv(find_dotenv())

# paths to input data
dir_prj = Path.cwd().parent
dir_data = dir_prj/'data'
dir_raw = dir_data/'raw'

# import the two preprocessors from the examples
sys.path.append(str(dir_prj/'src'))
from ba_samples.preprocessing import EnrichStandardGeography, KeepOnlyEnrichColumns, ArrayToDataFrame

# specifically, the data being used for this example - pickled dataframes
postal_codes_pth = dir_raw/'pdx_postal_codes.pkl'
block_groups_pth = dir_raw/'pdx_block_groups.pkl'

Pipeline for Zip Codes

We start by getting a list of standard geography codes to use from the demonstration postal code data, just a list of zip codes.

[3]:
postal_code_df = pd.read_pickle(postal_codes_pth)
postal_code_lst = list(postal_code_df['ID'])

print(postal_code_lst)
['97038', '97101', '97304', '97347', '97362', '97375', '97396', '97002', '97003', '97004', '97005', '97006', '97007', '97008', '97009', '97013', '97015', '97016', '97017', '97018', '97022', '97024', '97027', '97030', '97032', '97034', '97035', '97036', '97042', '97045', '97048', '97051', '97053', '97054', '97056', '97060', '97062', '97064', '97068', '97070', '97071', '97078', '97080', '97086', '97089', '97106', '97109', '97111', '97113', '97114', '97115', '97116', '97117', '97119', '97123', '97124', '97125', '97127', '97128', '97132', '97133', '97140', '97144', '97148', '97201', '97202', '97203', '97204', '97205', '97206', '97209', '97210', '97211', '97212', '97213', '97214', '97215', '97216', '97217', '97218', '97219', '97220', '97221', '97222', '97223', '97224', '97225', '97227', '97229', '97230', '97231', '97232', '97233', '97236', '97239', '97266', '97267', '97378', '98601', '98604', '98606', '98607', '98629', '98642', '98649', '98660', '98661', '98662', '98663', '98664', '98665', '98674', '98675', '98682', '98683', '98684', '98685', '98686', '97023', '97010', '97011', '97014', '97019', '97028', '97049', '97055', '97067', '98605', '98610', '98639', '98648', '98651', '98671']

Get a Local Country

To enrich, we need a Country object instance. As part of the constructor, we need to tell the object what Business Analyst source to use in the gis parameter. In this case, we are telling the object to use ArcGIS Pro with Business Analyst and local data for the United States.

[4]:
usa_local = Country('usa', gis=GIS('pro'))

usa_local
[4]:
<Country - United States 2021 ('local')>

Select Variables

Next, we need to get some enrich variables to use. We can discover what is available using the enrich_variables property of the country object to retrieve a Pandas Data Frame of variables available for the country. From these tens of thousands of variables, we can prune this down to a manageable subset.

[5]:
ev = usa_local.enrich_variables
kv = ev[
    (ev.data_collection.str.lower().str.contains('key'))
    & (ev.name.str.lower().str.endswith('cy'))
].reset_index(drop=True)

kv
[5]:
name alias data_collection enrich_name enrich_field_name
0 TOTPOP_CY 2021 Total Population KeyUSFacts KeyUSFacts.TOTPOP_CY KeyUSFacts_TOTPOP_CY
1 GQPOP_CY 2021 Group Quarters Population KeyUSFacts KeyUSFacts.GQPOP_CY KeyUSFacts_GQPOP_CY
2 DIVINDX_CY 2021 Diversity Index KeyUSFacts KeyUSFacts.DIVINDX_CY KeyUSFacts_DIVINDX_CY
3 TOTHH_CY 2021 Total Households KeyUSFacts KeyUSFacts.TOTHH_CY KeyUSFacts_TOTHH_CY
4 AVGHHSZ_CY 2021 Average Household Size KeyUSFacts KeyUSFacts.AVGHHSZ_CY KeyUSFacts_AVGHHSZ_CY
5 MEDHINC_CY 2021 Median Household Income KeyUSFacts KeyUSFacts.MEDHINC_CY KeyUSFacts_MEDHINC_CY
6 AVGHINC_CY 2021 Average Household Income KeyUSFacts KeyUSFacts.AVGHINC_CY KeyUSFacts_AVGHINC_CY
7 PCI_CY 2021 Per Capita Income KeyUSFacts KeyUSFacts.PCI_CY KeyUSFacts_PCI_CY
8 TOTHU_CY 2021 Total Housing Units KeyUSFacts KeyUSFacts.TOTHU_CY KeyUSFacts_TOTHU_CY
9 OWNER_CY 2021 Owner Occupied HUs KeyUSFacts KeyUSFacts.OWNER_CY KeyUSFacts_OWNER_CY
10 RENTER_CY 2021 Renter Occupied HUs KeyUSFacts KeyUSFacts.RENTER_CY KeyUSFacts_RENTER_CY
11 VACANT_CY 2021 Vacant Housing Units KeyUSFacts KeyUSFacts.VACANT_CY KeyUSFacts_VACANT_CY
12 MEDVAL_CY 2021 Median Home Value KeyUSFacts KeyUSFacts.MEDVAL_CY KeyUSFacts_MEDVAL_CY
13 AVGVAL_CY 2021 Average Home Value KeyUSFacts KeyUSFacts.AVGVAL_CY KeyUSFacts_AVGVAL_CY
14 POPGRW10CY 2010-2021 Growth Rate: Population KeyUSFacts KeyUSFacts.POPGRW10CY KeyUSFacts_POPGRW10CY
15 HHGRW10CY 2010-2021 Growth Rate: Households KeyUSFacts KeyUSFacts.HHGRW10CY KeyUSFacts_HHGRW10CY
16 FAMGRW10CY 2010-2021 Growth Rate: Families KeyUSFacts KeyUSFacts.FAMGRW10CY KeyUSFacts_FAMGRW10CY
17 DPOP_CY 2021 Total Daytime Population KeyUSFacts KeyUSFacts.DPOP_CY KeyUSFacts_DPOP_CY
18 DPOPWRK_CY 2021 Daytime Pop: Workers KeyUSFacts KeyUSFacts.DPOPWRK_CY KeyUSFacts_DPOPWRK_CY
19 DPOPRES_CY 2021 Daytime Pop: Residents KeyUSFacts KeyUSFacts.DPOPRES_CY KeyUSFacts_DPOPRES_CY

Get Standard Geography Name

Using the levels property, we can see the value we need to use for specifying the specfic zip code level, zip5.

[6]:
usa_local.levels
[6]:
level_name alias level_id id_field name_field singular_name plural_name admin_level
0 block_groups Block Groups US.BlockGroups ID NAME Block Group Block Groups Admin11
1 tracts Census Tracts US.Tracts ID NAME Census Tract Census Tracts Admin10
2 places Cities and Towns (Places) US.Places ID NAME Place Places Admin9
3 zip5 ZIP Codes US.ZIP5 ID NAME ZIP Code ZIP Codes Admin4
4 csd County Subdivisions US.CSD ID NAME County Subdivision County Subdivisions Admin7
5 counties Counties US.Counties ID NAME County Counties Admin3
6 cbsa CBSAs US.CBSA ID NAME CBSA CBSAs Admin5
7 cd Congressional Districts US.CD ID NAME Congressional District Congressional Districts Admin8
8 dma DMAs US.DMA ID NAME DMA DMAs Admin6
9 states States US.States ID NAME State States Admin2
10 whole_usa Entire Country US.WholeUSA ID NAME United States of America United States of America Admin1

Enrich Preprocessor

The enrich method can be wrapped into the transform method of a SciKit-Learn Transformer, specifically a preprocessor, to function just like any other preprocessor. In this case, not only have I wrapped the preprocessor, I also have created this preprocessor for specific inputs, standard geographies.

[7]:
from ba_samples.preprocessing import EnrichStandardGeography

enrich_preprocessor = EnrichStandardGeography(usa_local, enrich_variables=kv, standard_geography_level='zip5', return_geometry=False)

enrich_df = enrich_preprocessor.fit_transform(postal_code_lst)

enrich_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133 entries, 0 to 132
Data columns (total 26 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   id_field            133 non-null    object
 1   area_desc           133 non-null    object
 2   ta_desc             133 non-null    object
 3   names               133 non-null    object
 4   has_data            133 non-null    int32
 5   aggregation_method  133 non-null    object
 6   totpop_cy           133 non-null    float64
 7   gqpop_cy            133 non-null    float64
 8   divindx_cy          133 non-null    float64
 9   tothh_cy            133 non-null    float64
 10  avghhsz_cy          133 non-null    float64
 11  medhinc_cy          133 non-null    float64
 12  avghinc_cy          133 non-null    float64
 13  pci_cy              133 non-null    float64
 14  tothu_cy            133 non-null    float64
 15  owner_cy            133 non-null    float64
 16  renter_cy           133 non-null    float64
 17  vacant_cy           133 non-null    float64
 18  medval_cy           133 non-null    float64
 19  avgval_cy           133 non-null    float64
 20  popgrw10_cy         133 non-null    float64
 21  hhgrw10_cy          133 non-null    float64
 22  famgrw10_cy         133 non-null    float64
 23  dpop_cy             133 non-null    float64
 24  dpopwrk_cy          133 non-null    float64
 25  dpopres_cy          133 non-null    float64
dtypes: float64(20), int32(1), object(5)
memory usage: 26.6+ KB
[8]:
enrich_df.head()
[8]:
id_field area_desc ta_desc names has_data aggregation_method totpop_cy gqpop_cy divindx_cy tothh_cy ... renter_cy vacant_cy medval_cy avgval_cy popgrw10_cy hhgrw10_cy famgrw10_cy dpop_cy dpopwrk_cy dpopres_cy
0 97038 97038 TA from geography Layer: 97038 Molalla 1 BlockApportionment:US.BlockGroups;PointsLayer:... 17305.0 56.0 42.4 6144.0 ... 1552.0 293.0 388300.0 442226.0 1.18 1.20 1.00 13803.0 4468.0 9335.0
1 97002 97002 TA from geography Layer: 97002 Aurora 1 BlockApportionment:US.BlockGroups;PointsLayer:... 5951.0 52.0 54.3 2285.0 ... 389.0 85.0 491246.0 565932.0 1.09 1.15 0.97 7040.0 3847.0 3193.0
2 97347 97347 TA from geography Layer: 97347 Grand Ronde 1 BlockApportionment:US.BlockGroups;PointsLayer:... 2163.0 0.0 53.5 850.0 ... 242.0 60.0 375287.0 446223.0 1.28 1.31 1.13 2296.0 907.0 1389.0
3 97013 97013 TA from geography Layer: 97013 Canby 1 BlockApportionment:US.BlockGroups;PointsLayer:... 25403.0 104.0 54.2 9139.0 ... 2458.0 340.0 392366.0 437940.0 0.96 0.98 0.85 21914.0 9136.0 12778.0
4 97017 97017 TA from geography Layer: 97017 Colton 1 BlockApportionment:US.BlockGroups;PointsLayer:... 3194.0 0.0 16.9 1133.0 ... 125.0 58.0 434483.0 480442.0 0.68 0.72 0.58 2405.0 687.0 1718.0

5 rows × 26 columns

Filtering Columns

The output from enrich includes a number of metadata columns. If performing subsequent analysis, we need to prune these columns. We can use another preprocessor for this as well, one only keeping columns with enriched values.

[9]:
from ba_samples.preprocessing import KeepOnlyEnrichColumns

enrich_zip_pipe =  make_pipeline(
    EnrichStandardGeography(usa_local, enrich_variables=kv, standard_geography_level='zip5', return_geometry=False),
    KeepOnlyEnrichColumns(usa_local, id_column='id_field', keep_geometry=False)
)

enrich_df = enrich_zip_pipe.fit_transform(postal_code_lst)

enrich_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 133 entries, 97038 to 98671
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   totpop_cy   133 non-null    float64
 1   gqpop_cy    133 non-null    float64
 2   divindx_cy  133 non-null    float64
 3   tothh_cy    133 non-null    float64
 4   avghhsz_cy  133 non-null    float64
 5   medhinc_cy  133 non-null    float64
 6   avghinc_cy  133 non-null    float64
 7   pci_cy      133 non-null    float64
 8   tothu_cy    133 non-null    float64
 9   owner_cy    133 non-null    float64
 10  renter_cy   133 non-null    float64
 11  vacant_cy   133 non-null    float64
 12  medval_cy   133 non-null    float64
 13  avgval_cy   133 non-null    float64
 14  dpop_cy     133 non-null    float64
 15  dpopwrk_cy  133 non-null    float64
 16  dpopres_cy  133 non-null    float64
dtypes: float64(17)
memory usage: 18.7+ KB
[10]:
enrich_df.head()
[10]:
keyusfacts_totpop_cy keyusfacts_gqpop_cy keyusfacts_divindx_cy keyusfacts_tothh_cy keyusfacts_avghhsz_cy keyusfacts_medhinc_cy keyusfacts_avghinc_cy keyusfacts_pci_cy keyusfacts_tothu_cy keyusfacts_owner_cy keyusfacts_renter_cy keyusfacts_vacant_cy keyusfacts_medval_cy keyusfacts_avgval_cy keyusfacts_popgrw10cy keyusfacts_hhgrw10cy keyusfacts_famgrw10cy keyusfacts_dpop_cy keyusfacts_dpopwrk_cy keyusfacts_dpopres_cy
id_field
83801 8497.0 0.0 13.6 3142.0 2.70 62413.0 78746.0 30156.0 3563.0 2660.0 482.0 421.0 382542.0 391184.0 2.07 2.09 1.92 6502.0 1962.0 4540.0
83803 656.0 0.0 10.5 341.0 1.92 43344.0 61732.0 29472.0 659.0 264.0 76.0 318.0 391333.0 403113.0 1.76 1.80 1.56 499.0 145.0 354.0
83810 1090.0 4.0 11.7 479.0 2.27 60088.0 89006.0 38340.0 579.0 406.0 72.0 100.0 437903.0 464383.0 0.94 0.96 0.76 762.0 246.0 516.0
83814 28546.0 609.0 21.2 12454.0 2.24 57487.0 80524.0 34960.0 15054.0 6855.0 5599.0 2600.0 377722.0 508589.0 1.86 1.91 1.72 38178.0 23011.0 15167.0
83815 38614.0 617.0 24.1 15247.0 2.49 56872.0 73431.0 29323.0 16393.0 9649.0 5598.0 1146.0 308198.0 374194.0 1.79 1.78 1.51 37345.0 15891.0 21454.0

Clustering Block Groups Using Demographics

Stepping up our game a little, now we are getting all the US Census Block Groups in Portland because we are interested in what demographic variables have the most differences in Portland, and are going to use a SciKit-Learn pipeline to streamline the process.

[10]:
block_groups_df = pd.read_pickle(block_groups_pth)
block_groups_lst = list(block_groups_df['ID'])

print(f'Block Group Count: {len(block_groups_lst):,}')
Block Group Count: 1,421

Source - ArcGIS Online

This time around, we no longer are using local resources. Now, we are switching to ArcGIS Online.

[12]:
gis_agol = GIS(url=os.getenv('ESRI_GIS_URL'), username=os.getenv('ESRI_GIS_USERNAME'), password=os.getenv('ESRI_GIS_PASSWORD'))
usa_agol = Country('usa', gis=gis_agol)

usa_agol
[12]:
<Country - United States (GIS @ https://baqa.mapsqa.arcgis.com version:9.4)>

Selecting Variables

This time around, we are going big. By just grabbing a lot of categories, we are selecting a wide swath of demographic varaiables. There wasn not a tremenedously emperical method used to retrieve these categories. I just grabbed quite a few. Below, please notice how I used some naming conventions to filter out some values to exclude.

[13]:
collections = ['classofworker', 'commute', 'disability', 'disposableincome', 'disposableincome', 'educationalattainment',
               'financial', 'foodstampssnap', 'gender', 'generations', 'groupquarters', 'health', 'healthinsurancecoverage',
               'homevalue', 'householdincome', 'households', 'householdsbyageofhouseholder', 'householdsbysize',
               'householdtotals', 'householdtype', 'housingbyageofhouseholder', 'housingbysize', 'housingcosts',
               'housinghousehold', 'housingunittotals', 'industry', 'internetcomputerusage', 'keyusfacts',
               'language', 'network', 'occupation', 'wealth', 'veterans']


ev = usa_agol.enrich_variables

enrich_vars = ev[
    (ev.data_collection.isin(collections))
    & (~ev.alias.str.lower().str.contains('index'))  # exclude calculated indexes
    & (~ev.name.str.lower().str.endswith('_a'))      # exclude averages
    & (~ev.name.str.lower().str.contains('_fy'))     # exclude future year
    & (~ev.name.str.lower().str.contains('fy_'))     # exclude another future year pattern
    & (~ev.alias.str.contains('2010'))               # exclude 2010 census variables - we're not living in the past anymore
].drop_duplicates('name').reset_index(drop=True)

enrich_vars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1686 entries, 0 to 1685
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   name               1686 non-null   object
 1   alias              1686 non-null   object
 2   data_collection    1686 non-null   object
 3   enrich_name        1686 non-null   object
 4   enrich_field_name  1686 non-null   object
 5   description        1686 non-null   object
 6   vintage            1686 non-null   object
 7   units              1686 non-null   object
dtypes: object(8)
memory usage: 105.5+ KB

Just double checking, we can quickly see the correct value we need to use for United States Census Block Groups is block_groups.

[14]:
usa_agol.levels
[14]:
level_name singular_name plural_name alias level_id admin_level
0 block_groups Block Group Block Groups Block Groups US.BlockGroups
1 tracts Census Tract Census Tracts Census Tracts US.Tracts
2 places Place Places Cities and Towns (Places) US.Places
3 zip5 ZIP Code ZIP Codes ZIP Codes US.ZIP5 Admin4
4 csd County Subdivision County Subdivisions County Subdivisions US.CSD
5 counties County Counties Counties US.Counties Admin3
6 cbsa CBSA CBSAs CBSAs US.CBSA
7 cd Congressional District Congressional Districts Congressional Districts US.CD
8 dma DMA DMAs DMAs US.DMA
9 states State States States US.States Admin2
10 whole_usa United States of America United States of America Entire Country US.WholeUSA Admin1

Get Demographics

Just like before, we are going to enrich, but this time around we are using a LOT of data - over 1,500 variables. This time around, we are including the geometry so we can use it later.

[15]:
# run the enrich steps autonomously so we can access the keeper properties in later steps
enrich_pipe = make_pipeline(
    EnrichStandardGeography(usa_agol, enrich_variables=enrich_vars, standard_geography_level='block_groups', return_geometry=True),
    KeepOnlyEnrichColumns(usa_agol, keep_geometry=True)
)
enrich_df = enrich_pipe.fit_transform(block_groups_lst)

enrich_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1421 entries, 0 to 1420
Columns: 1324 entries, acscivemp to SHAPE
dtypes: float64(67), geometry(1), int64(1256)
memory usage: 14.4 MB

Get Just the Enriched Columns

Following enrichment, now we are getting just the columns not containing the geometry, the polygon shapes of the Block Groups. Notice the geometry date type is no longer listed.

[16]:
demographic_cols = [c for c in enrich_df.columns if c != enrich_df.spatial.name]
training_df = enrich_df.loc[:,demographic_cols]

training_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1421 entries, 0 to 1420
Columns: 1323 entries, acscivemp to acstotpop
dtypes: float64(67), int64(1256)
memory usage: 14.3 MB

Data Cleanup

Since subsequent steps need standard scaled data and have difficulty with zero variance and missing values, we use tools from Sci-Kit Learn to clean up the data.

[17]:
from sklearn.cluster import KMeans

from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

scale_pipe = make_pipeline(
    MinMaxScaler(),                         # get on same scale - just using because it is simple
    VarianceThreshold(threshold=0.0),       # drop zero variance variables
    SimpleImputer()                         # using defaults, filling nulls with mean
)

scaled_arr = scale_pipe.fit_transform(training_df)

scaled_arr
[17]:
array([[0.11816762, 0.16923077, 0.66666667, ..., 0.        , 0.        ,
        0.19456258],
       [0.1832379 , 0.17342657, 0.33333333, ..., 0.        , 0.        ,
        0.23329283],
       [0.07105674, 0.18881119, 0.66666667, ..., 0.        , 0.        ,
        0.1240887 ],
       ...,
       [0.15955232, 0.12167832, 0.33333333, ..., 0.        , 0.        ,
        0.20990279],
       [0.0840708 , 0.23636364, 0.66666667, ..., 0.        , 0.        ,
        0.13456865],
       [0.26106195, 0.21818182, 0.33333333, ..., 0.        , 0.        ,
        0.30726002]])

Dimensionalty Reduction

Over 1,000 variables is a lot to process. To speed up the process, we first are going to use Principal Component Analysis (PCA) for dimensionality reduction, using a scree plot to ascertain how many components to create. In this case, we are going with 200 to retain roughly 90% of the variation in the data.

[18]:
from sklearn.decomposition import PCA

pca = PCA().fit(scaled_arr)

plt.style.use('ggplot')
plt.figure(figsize=(10,8))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.yticks([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
plt.show()
../_images/notebooks_pipelines-clustering_36_0.png

Clustering

From the scree plot we know the number of clusters to use with PCA, so we are now going to combine this with K-Means Clustering in a succinct pipeline. Once we have the clusters created, we then will create a new spatially enabled Pandas data frame by combining the output clusters with the geometry returned from geoenrichment earlier.

[19]:
from sklearn.cluster import KMeans

cluster_pipe = make_pipeline(
    PCA(n_components=200),
    KMeans(n_clusters=5)
)
cluster_pipe.fit_transform(scaled_arr)

cluster_df = pd.DataFrame(zip(block_groups_lst, cluster_pipe.named_steps.kmeans.labels_, enrich_df['SHAPE']),
                          columns=['fips', 'cluster_id', 'SHAPE'])
cluster_df.spatial.set_geometry('SHAPE')

cluster_df.info()
cluster_df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1421 entries, 0 to 1420
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   fips        1421 non-null   object
 1   cluster_id  1421 non-null   int32
 2   SHAPE       1421 non-null   geometry
dtypes: geometry(1), int32(1), object(1)
memory usage: 27.9+ KB
[19]:
fips cluster_id SHAPE
0 410710305021 2 {"rings": [[[-123.56601700006303, 45.216389999...
1 410050237001 1 {"rings": [[[-122.61617099963205, 45.267457999...
2 410050237002 4 {"rings": [[[-122.5704800001157, 45.2375169997...
3 410050237003 1 {"rings": [[[-122.5111050007072, 45.2601139996...
4 410050237004 1 {"rings": [[[-122.50749899966338, 45.230098999...

Dissolve Clusters

For visualization on a map and also for further investigation using Inforgaphics, we need the geometries consolidated into a single geometry per cluster. The ArcGIS Online Geometry engine can be used through the arcgis.geometry.union method to accomplish this.

[20]:
from arcgis.geometry import union

def dissolve_by_cluster_id(cluster_id):
    """Helper to dissolve geometries based on the cluster_id."""
    # pull all the geometries out of the cluster dataframe matching the cluster id as a list
    cluster_geom_lst = list(cluster_df[cluster_df['cluster_id'] == cluster_id]['SHAPE'])

    # use the ArcGIS Online geometry service to combine all the geometeries into one
    dissolved_cluster_geom = union(cluster_geom_lst, gis=gis_agol)

    return dissolved_cluster_geom
[21]:
uniq_cluster_id_lst = cluster_df['cluster_id'].unique()
uniq_cluster_id_lst.sort()

cluster_geom_lst = []
for c_id in uniq_cluster_id_lst:
    c_geom = dissolve_by_cluster_id(c_id)
    cluster_geom_lst.append(c_geom)
    print(f'Created geometry for cluster id {c_id}')

dissolved_cluster_df = pd.DataFrame(zip(uniq_cluster_id_lst, cluster_geom_lst), columns=['cluster_id', 'SHAPE'])
dissolved_cluster_df.spatial.set_geometry('SHAPE')

dissolved_cluster_df
Created geometry for cluster id 0
Created geometry for cluster id 1
Created geometry for cluster id 2
Created geometry for cluster id 3
Created geometry for cluster id 4
[21]:
cluster_id SHAPE
0 0 {"rings": [[[-122.58948899999996, 45.181993000...
1 1 {"rings": [[[-122.61984599899995, 45.118054999...
2 2 {"rings": [[[-123.38401899999997, 45.110430000...
3 3 {"rings": [[[-123.45598799999999, 45.125803999...
4 4 {"rings": [[[-123.37559099899994, 45.094734000...

Visualizing, Sharing and Interrogating

Both outputs from the clustering are spatially enabled Pandas data frames, and can easily be shared with ArcGIS Online. From there, the results of these analyses can be quickly visualized in maps and mapping applications, shared to communicate results, and further interrogated through Infographics.

[22]:
bg_lyr = cluster_df.spatial.to_featurelayer('pdx_cluster_block_groups',
                                            gis=gis_agol,
                                            tags=['pdx', 'machine learning', 'clustering'],
                                            service_name='pdx_cluster_block_groups')

bg_lyr
[22]:
pdx_cluster_block_groups
Feature Layer Collection by jmccune_baqa
Last Modified: February 16, 2022
0 comments, 0 views
[23]:
clstr_lyr = dissolved_cluster_df.spatial.to_featurelayer('pdx_clusters',
                                                         gis=gis_agol,
                                                         tags=['pdx', 'machine learning', 'clustering'],
                                                         service_name='pdx_clusters')

clstr_lyr
[23]:
pdx_clusters
Feature Layer Collection by jmccune_baqa
Last Modified: February 16, 2022
0 comments, 0 views
[ ]: