{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Getting Data for Medical Fraud Detection\n", " \n", "The need, in this case, is to get potentially relevant variables to Excel with _intelligible_ column names for follow on analyses for medical fraud detection using Excel.\n", "\n", "This example will run with Python API version 1.9.1 _if_ you are using a Web GIS, but this will use credits (a LOT) if you are using ArcGIS Online. Once the `enrich` method is added, this will run locally as well." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from pathlib import Path\n", "import os\n", "\n", "from arcgis.features import GeoAccessor\n", "from arcgis.geoenrichment import Country, enrich\n", "from arcgis.gis import GIS\n", "from dotenv import load_dotenv, find_dotenv\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "GIS @ https://bateam.maps.arcgis.com" ], "text/plain": [ "GIS @ https://bateam.maps.arcgis.com version:9.4" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# paths to common data locations\n", "dir_prj = Path.cwd().parent\n", "dir_data = dir_prj/'data'\n", "dir_int = dir_data/'interim'\n", "\n", "# load environment variables from .env\n", "load_dotenv(find_dotenv())\n", "\n", "# create a GIS object instance; if you did not enter any information here, it defaults to anonymous access to ArcGIS Online\n", "gis = GIS(\n", " url=os.getenv('ESRI_GIS_URL'), \n", " username=os.getenv('ESRI_GIS_USERNAME'),\n", " password=None if len(os.getenv('ESRI_GIS_PASSWORD')) is 0 else os.getenv('ESRI_GIS_PASSWORD')\n", ")\n", "\n", "gis" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Create a Country Object Instance\n", "\n", "The starting point is creation of a `Country` object instance to work with." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cntry = Country('usa', gis=gis)\n", "\n", "cntry" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Retrieve Standard Geographies for Analysis\n", "\n", "Standard geographies can be retrieved from the country object. Especially for the CBSA's, the exact string can be difficult to figure out. Thankfully, the [`standard_geography_query` method](https://developers.arcgis.com/python/api-reference/arcgis.geoenrichment.html#standard-geography-query) can be used to search for the exact string to use for retrieving subgeographies." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'98501': ,\n", " '98502': ,\n", " '98503': ,\n", " '98506': ,\n", " '98512': ,\n", " '98513': ,\n", " '98516': ,\n", " '98530': ,\n", " '98531': ,\n", " '98576': ,\n", " '98579': ,\n", " '98589': ,\n", " '98597': }" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zip_dict = cntry.subgeographies.cbsa['Olympia-Lacey-Tumwater,_WA_Metropolitan_Statistical_Area'].zip5\n", "\n", "zip_dict" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Enrich" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get Variables for Enrichment\n", "\n", "Using the filtering capabilities for Pandas data frames, we can quickly create a list of variables to work with. It is useful to note, even though I do not take advantage of it below, the [Pandas `contains`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) method supports [Python regular expression](https://docs.python.org/3/howto/regex.html) string syntax facilitating very powerful filtering." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namealiasdata_collectionenrich_nameenrich_field_namedescriptionvintageunits
0X8001_X2021 Health CareHealthHealth.X8001_XHealth_X8001_X2021 Health Care2021currency
1X8002_X2021 Health InsuranceHealthHealth.X8002_XHealth_X8002_X2021 Health Insurance2021currency
2X8041_X2021 Fee for Svc Health Plan Excl Blue Cross/B...HealthHealth.X8041_XHealth_X8041_X2021 Fee for Service Health Plan Excluding Blu...2021currency
3X8012_X2021 HMO Excl Blue Cross/Blue ShieldHealthHealth.X8012_XHealth_X8012_X2021 HMO Excluding Blue Cross/Blue Shield2021currency
4X8015_X2021 Medicare Supplements Excl Blue Cross/Blue...HealthHealth.X8015_XHealth_X8015_X2021 Medicare Supplements Excluding Blue Cross...2021currency
...........................
63MEDAGE_CY2021 Median AgeHealthHealth.MEDAGE_CYHealth_MEDAGE_CY2021 Median Age (Esri)2021count
64MEDHINC_CY2021 Median Household IncomeHealthHealth.MEDHINC_CYHealth_MEDHINC_CY2021 Median Household Income (Esri)2021currency
65S27_BUS2021 Health Services - Businesses (SIC)HealthHealth.S27_BUSHealth_S27_BUS2021 Health Services (SIC80) Businesses2021count
66S27_SALES2021 Health Services - Sales ($000) (SIC)HealthHealth.S27_SALESHealth_S27_SALES2021 Health Services (SIC80) Sales ($000)2021currency
67S27_EMP2021 Health Services - Employees (SIC)HealthHealth.S27_EMPHealth_S27_EMP2021 Health Services (SIC80) Employees2021count
\n", "

68 rows × 8 columns

\n", "
" ], "text/plain": [ " name alias \\\n", "0 X8001_X 2021 Health Care \n", "1 X8002_X 2021 Health Insurance \n", "2 X8041_X 2021 Fee for Svc Health Plan Excl Blue Cross/B... \n", "3 X8012_X 2021 HMO Excl Blue Cross/Blue Shield \n", "4 X8015_X 2021 Medicare Supplements Excl Blue Cross/Blue... \n", ".. ... ... \n", "63 MEDAGE_CY 2021 Median Age \n", "64 MEDHINC_CY 2021 Median Household Income \n", "65 S27_BUS 2021 Health Services - Businesses (SIC) \n", "66 S27_SALES 2021 Health Services - Sales ($000) (SIC) \n", "67 S27_EMP 2021 Health Services - Employees (SIC) \n", "\n", " data_collection enrich_name enrich_field_name \\\n", "0 Health Health.X8001_X Health_X8001_X \n", "1 Health Health.X8002_X Health_X8002_X \n", "2 Health Health.X8041_X Health_X8041_X \n", "3 Health Health.X8012_X Health_X8012_X \n", "4 Health Health.X8015_X Health_X8015_X \n", ".. ... ... ... \n", "63 Health Health.MEDAGE_CY Health_MEDAGE_CY \n", "64 Health Health.MEDHINC_CY Health_MEDHINC_CY \n", "65 Health Health.S27_BUS Health_S27_BUS \n", "66 Health Health.S27_SALES Health_S27_SALES \n", "67 Health Health.S27_EMP Health_S27_EMP \n", "\n", " description vintage units \n", "0 2021 Health Care 2021 currency \n", "1 2021 Health Insurance 2021 currency \n", "2 2021 Fee for Service Health Plan Excluding Blu... 2021 currency \n", "3 2021 HMO Excluding Blue Cross/Blue Shield 2021 currency \n", "4 2021 Medicare Supplements Excluding Blue Cross... 2021 currency \n", ".. ... ... ... \n", "63 2021 Median Age (Esri) 2021 count \n", "64 2021 Median Household Income (Esri) 2021 currency \n", "65 2021 Health Services (SIC80) Businesses 2021 count \n", "66 2021 Health Services (SIC80) Sales ($000) 2021 currency \n", "67 2021 Health Services (SIC80) Employees 2021 count \n", "\n", "[68 rows x 8 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ev = cntry.enrich_variables\n", "\n", "sv = ev[\n", " (ev.data_collection == 'Health') # \"Health\" data collection\n", " & (ev.vintage.str.endswith('2021')) # 2021 variables\n", " & (~ev.alias.str.contains('Avg:')) # Exclude averages\n", " & (~ev.alias.str.contains('Index:')) # Exclude index variables\n", "].drop_duplicates('name').reset_index(drop=True)\n", "\n", "sv" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Perform Enrichment\n", "\n", "Enrichment is as straightforward as running the [`enrich` method](https://developers.arcgis.com/python/api-reference/arcgis.geoenrichment.html#enrich). Please notice the input for the `analysis_variables` parameter, a list of variable names. In the next release, you will be able to just input the filtered enrichment variables dataframe to make this easier, but for now, we still need to prepare the input for this parameter a bit." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDOBJECTIDStdGeographyLevelStdGeographyNameStdGeographyIDsourceCountryaggregationMethodpopulationToPolygonSizeRatingapportionmentConfidenceHasData...POP75_CYPOP80_CYPOP85_CYPOP18UP_CYPOP21UP_CYMEDAGE_CYMEDHINC_CYS27_BUSS27_SALESS27_EMP
001US.ZIP5Olympia98501USQuery:US.ZIP52.1912.5761...15169451028368263533441.980354112140533933
112US.ZIP5Olympia98502USQuery:US.ZIP52.1912.5761...1141674794299922829739.8763551822660621790
223US.ZIP5Lacey98503USQuery:US.ZIP52.1912.5761...13269151318330653130137.570669962422721643
334US.ZIP5Olympia98506USQuery:US.ZIP52.1912.5761...714440718162721575545.3723981712971614198
445US.ZIP5Olympia98512USQuery:US.ZIP52.1912.5761...1039610621263052527343.2810024139399370
556US.ZIP5Olympia98513USQuery:US.ZIP52.1912.5761...938489452273632626238.1825934156421
667US.ZIP5Olympia98516USQuery:US.ZIP52.1912.5761...1160663497219492110540.7894785368868451
778US.ZIP5Bucoda98530USQuery:US.ZIP52.1912.5761...117529929344.250000000
889US.ZIP5Centralia98531USQuery:US.ZIP52.1912.5761...901670861197661886438.352691881125731719
9910US.ZIP5Rainier98576USQuery:US.ZIP52.1912.5761...15177664491431944.47836111941
101011US.ZIP5Rochester98579USQuery:US.ZIP52.1912.5761...466260172113611087442.1768996933455
111112US.ZIP5Tenino98589USQuery:US.ZIP52.1912.5761...3031681317238698145.4707656337727
121213US.ZIP5Yelm98597USQuery:US.ZIP52.1912.5761...647371303203021941137.5749693230649200
\n", "

13 rows × 78 columns

\n", "
" ], "text/plain": [ " ID OBJECTID StdGeographyLevel StdGeographyName StdGeographyID \\\n", "0 0 1 US.ZIP5 Olympia 98501 \n", "1 1 2 US.ZIP5 Olympia 98502 \n", "2 2 3 US.ZIP5 Lacey 98503 \n", "3 3 4 US.ZIP5 Olympia 98506 \n", "4 4 5 US.ZIP5 Olympia 98512 \n", "5 5 6 US.ZIP5 Olympia 98513 \n", "6 6 7 US.ZIP5 Olympia 98516 \n", "7 7 8 US.ZIP5 Bucoda 98530 \n", "8 8 9 US.ZIP5 Centralia 98531 \n", "9 9 10 US.ZIP5 Rainier 98576 \n", "10 10 11 US.ZIP5 Rochester 98579 \n", "11 11 12 US.ZIP5 Tenino 98589 \n", "12 12 13 US.ZIP5 Yelm 98597 \n", "\n", " sourceCountry aggregationMethod populationToPolygonSizeRating \\\n", "0 US Query:US.ZIP5 2.191 \n", "1 US Query:US.ZIP5 2.191 \n", "2 US Query:US.ZIP5 2.191 \n", "3 US Query:US.ZIP5 2.191 \n", "4 US Query:US.ZIP5 2.191 \n", "5 US Query:US.ZIP5 2.191 \n", "6 US Query:US.ZIP5 2.191 \n", "7 US Query:US.ZIP5 2.191 \n", "8 US Query:US.ZIP5 2.191 \n", "9 US Query:US.ZIP5 2.191 \n", "10 US Query:US.ZIP5 2.191 \n", "11 US Query:US.ZIP5 2.191 \n", "12 US Query:US.ZIP5 2.191 \n", "\n", " apportionmentConfidence HasData ... POP75_CY POP80_CY POP85_CY \\\n", "0 2.576 1 ... 1516 945 1028 \n", "1 2.576 1 ... 1141 674 794 \n", "2 2.576 1 ... 1326 915 1318 \n", "3 2.576 1 ... 714 440 718 \n", "4 2.576 1 ... 1039 610 621 \n", "5 2.576 1 ... 938 489 452 \n", "6 2.576 1 ... 1160 663 497 \n", "7 2.576 1 ... 11 7 5 \n", "8 2.576 1 ... 901 670 861 \n", "9 2.576 1 ... 151 77 66 \n", "10 2.576 1 ... 466 260 172 \n", "11 2.576 1 ... 303 168 131 \n", "12 2.576 1 ... 647 371 303 \n", "\n", " POP18UP_CY POP21UP_CY MEDAGE_CY MEDHINC_CY S27_BUS S27_SALES S27_EMP \n", "0 36826 35334 41.9 80354 112 140533 933 \n", "1 29992 28297 39.8 76355 182 266062 1790 \n", "2 33065 31301 37.5 70669 96 242272 1643 \n", "3 16272 15755 45.3 72398 171 297161 4198 \n", "4 26305 25273 43.2 81002 41 39399 370 \n", "5 27363 26262 38.1 82593 4 1564 21 \n", "6 21949 21105 40.7 89478 53 68868 451 \n", "7 299 293 44.2 50000 0 0 0 \n", "8 19766 18864 38.3 52691 88 112573 1719 \n", "9 4491 4319 44.4 78361 1 194 1 \n", "10 11361 10874 42.1 76899 6 9334 55 \n", "11 7238 6981 45.4 70765 6 3377 27 \n", "12 20302 19411 37.5 74969 32 30649 200 \n", "\n", "[13 rows x 78 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "enrich_df = enrich(zip_dict, analysis_variables=list(sv.name), return_geometry=False, gis=gis)\n", "\n", "enrich_df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Add Aliases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Pandas Series for Alias Lookup\n", "\n", "We can create a Pandas series enabling easy column alias lookup by removing duplicate names, set the index to the column name, and just keeping the alias column." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name\n", "AGE0_CY 2021 Population Age <1\n", "AGE1_CY 2021 Population Age 1\n", "AGE2_CY 2021 Population Age 2\n", "AGE3_CY 2021 Population Age 3\n", "AGE4_CY 2021 Population Age 4\n", " ... \n", "RELRMV1989 2019 RHHs/Moved In: 1989/Before REL (ACS 5-Yr)\n", "ACSMEDYRMV 2019 Median Year Householder Moved In (ACS 5-Yr)\n", "MOEMEDYRMV 2019 Median Year Householder Moved In MOE (ACS...\n", "RELMEDYRMV 2019 Median Year Householder Moved In REL (ACS...\n", "RELOWNER 2019 Owner Households REL (ACS 5-Yr)\n", "Name: alias, Length: 16275, dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "var_lookup = ev.drop_duplicates('name').set_index('name')['alias']\n", "\n", "var_lookup" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Use Alias List to Look Up Relevant Column Aliases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using a ternerary operator in a list comprehension with the Pandas Series created in the last step enables us to look up aliases if there is a match and keep the existing column name if there is not a match. This enables us to create a list of column names for the output data." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ID',\n", " 'OBJECTID',\n", " 'StdGeographyLevel',\n", " 'StdGeographyName',\n", " 'StdGeographyID',\n", " 'sourceCountry',\n", " 'aggregationMethod',\n", " 'populationToPolygonSizeRating',\n", " 'apportionmentConfidence',\n", " 'HasData',\n", " '2021 Health Care',\n", " '2021 Health Insurance',\n", " '2021 Fee for Svc Health Plan Excl Blue Cross/Blue Shield',\n", " '2021 HMO Excl Blue Cross/Blue Shield',\n", " '2021 Medicare Supplements Excl Blue Cross/Blue Shield',\n", " '2021 Blue Cross/Blue Shield',\n", " '2021 Blue Cross/Blue Shield Fee-Svc Health Plan',\n", " '2021 Blue Cross/Blue Shield HMO',\n", " '2021 Blue Cross/Blue Shield Commercial Medicare Supplmt',\n", " '2021 Blue Cross/Blue Shield Dental Care Insur',\n", " '2021 Blue Cross/Blue Shield Vision Care Insur',\n", " '2021 Blue Cross/Blue Shield Prescription Drug Insur',\n", " '2021 Blue Cross/Blue Shield Long Term Care Insur',\n", " '2021 Blue Cross/Blue Shield Oth Single Svc Insur',\n", " '2021 Medicare Payments',\n", " '2021 Medicaid Premiums',\n", " '2021 Tricare/Military Premiums',\n", " '2021 Children`s Health Insur Program (CHIP) Premiums',\n", " '2021 Medical Care',\n", " '2021 Medical Services',\n", " '2021 Physician Services',\n", " '2021 Dental Services',\n", " '2021 Eyecare Services',\n", " '2021 Non-Physician Svcs Inside Home',\n", " '2021 Non-Physician Svcs Outside Home',\n", " '2021 Lab Tests/X-Rays',\n", " '2021 Hospital Room/Hospital Svc',\n", " '2021 Convalescent/Nursing Home Care',\n", " '2021 Oth Medical Svcs',\n", " '2021 Drugs/Vitamins',\n", " '2021 NonPrescription Drugs',\n", " '2021 Prescription Drugs',\n", " '2021 NonPrescription Vitamins',\n", " '2021 Medicare Prescription Drug Premium',\n", " '2021 Medical Supplies',\n", " '2021 Eyeglasses/Contact Lenses',\n", " '2021 Hearing Aids',\n", " '2021 Topicals/Dressings',\n", " '2021 Medical Equip-General Use',\n", " '2021 Supportive/Convalescent Medical Equip',\n", " '2021 Rental of Medical Equip-General Use',\n", " '2021 Rental of Supportive/Convalescent Medical Equip',\n", " '2021 Dom Tapestry Segment Name',\n", " '2021 Population Age 0-4',\n", " '2021 Population Age 5-9',\n", " '2021 Population Age 10-14',\n", " '2021 Population Age 15-19',\n", " '2021 Population Age 20-24',\n", " '2021 Population Age 25-29',\n", " '2021 Population Age 30-34',\n", " '2021 Population Age 35-39',\n", " '2021 Population Age 40-44',\n", " '2021 Population Age 45-49',\n", " '2021 Population Age 50-54',\n", " '2021 Population Age 55-59',\n", " '2021 Population Age 60-64',\n", " '2021 Population Age 65-69',\n", " '2021 Population Age 70-74',\n", " '2021 Population Age 75-79',\n", " '2021 Population Age 80-84',\n", " '2021 Population Age 85+',\n", " '2021 Population Age 18+',\n", " '2021 Population Age 21+',\n", " '2021 Median Age',\n", " '2021 Median Household Income',\n", " '2021 Health Services - Businesses (SIC)',\n", " '2021 Health Services - Sales ($000) (SIC)',\n", " '2021 Health Services - Employees (SIC)']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alias_lst = [var_lookup.loc[c] if c in var_lookup.index else c for c in enrich_df.columns]\n", "\n", "alias_lst" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Prune Columns\n", "\n", "If the intention for the output data is for subsequent analysis, it is easier to just have the unique identifier, in this case the zip code, and the enriched columns in the final output. We can create a list of these columns using a list comprehension to filter the column names. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StdGeographyIDX8001_XX8002_XX8041_XX8012_XX8015_XX8006_XX8042_XX8009_XX8010_X...POP75_CYPOP80_CYPOP85_CYPOP18UP_CYPOP21UP_CYMEDAGE_CYMEDHINC_CYS27_BUSS27_SALESS27_EMP
098501125564861829464941748048214884608717066121398640148558395515270480180...15169451028368263533441.980354112140533933
19850295864256633813271291962011663331558200016126731111953094135517370443...1141674794299922829739.8763551822660621790
29850393164805615543031289564710808456538719116054938111524534129485358297...13269151318330653130137.570669962422721643
3985065308708535003255706381460651163285978881780861163082249011222312...714440718162721575545.3723981712971614198
4985128446469155700618113524759668400505883914429007100613863651490352823...1039610621263052527343.2810024139399370
\n", "

5 rows × 69 columns

\n", "
" ], "text/plain": [ " StdGeographyID X8001_X X8002_X X8041_X X8012_X X8015_X X8006_X \\\n", "0 98501 125564861 82946494 17480482 14884608 7170661 21398640 \n", "1 98502 95864256 63381327 12919620 11663331 5582000 16126731 \n", "2 98503 93164805 61554303 12895647 10808456 5387191 16054938 \n", "3 98506 53087085 35003255 7063814 6065116 3285978 8817808 \n", "4 98512 84464691 55700618 11352475 9668400 5058839 14429007 \n", "\n", " X8042_X X8009_X X8010_X ... POP75_CY POP80_CY POP85_CY POP18UP_CY \\\n", "0 14855839 5515270 480180 ... 1516 945 1028 36826 \n", "1 11195309 4135517 370443 ... 1141 674 794 29992 \n", "2 11152453 4129485 358297 ... 1326 915 1318 33065 \n", "3 6116308 2249011 222312 ... 714 440 718 16272 \n", "4 10061386 3651490 352823 ... 1039 610 621 26305 \n", "\n", " POP21UP_CY MEDAGE_CY MEDHINC_CY S27_BUS S27_SALES S27_EMP \n", "0 35334 41.9 80354 112 140533 933 \n", "1 28297 39.8 76355 182 266062 1790 \n", "2 31301 37.5 70669 96 242272 1643 \n", "3 15755 45.3 72398 171 297161 4198 \n", "4 25273 43.2 81002 41 39399 370 \n", "\n", "[5 rows x 69 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id_col = 'StdGeographyID'\n", "\n", "keep_col_lst = [id_col] + [c for c in enrich_df.columns if c in var_lookup.index]\n", "\n", "sel_df = enrich_df.loc[:,keep_col_lst]\n", "\n", "sel_df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Apply Aliases\n", "\n", "Using the same method as above, we can create a list of aliases. These aliases can then be applied to the output data frame. Also, to faciliate quick retrieval by ID, we can set the index to this ID." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021 Health Care2021 Health Insurance2021 Fee for Svc Health Plan Excl Blue Cross/Blue Shield2021 HMO Excl Blue Cross/Blue Shield2021 Medicare Supplements Excl Blue Cross/Blue Shield2021 Blue Cross/Blue Shield2021 Blue Cross/Blue Shield Fee-Svc Health Plan2021 Blue Cross/Blue Shield HMO2021 Blue Cross/Blue Shield Commercial Medicare Supplmt2021 Blue Cross/Blue Shield Dental Care Insur...2021 Population Age 75-792021 Population Age 80-842021 Population Age 85+2021 Population Age 18+2021 Population Age 21+2021 Median Age2021 Median Household Income2021 Health Services - Businesses (SIC)2021 Health Services - Sales ($000) (SIC)2021 Health Services - Employees (SIC)
StdGeographyID
98501125564861829464941748048214884608717066121398640148558395515270480180327850...15169451028368263533441.980354112140533933
9850295864256633813271291962011663331558200016126731111953094135517370443248872...1141674794299922829739.8763551822660621790
9850393164805615543031289564710808456538719116054938111524534129485358297258021...13269151318330653130137.570669962422721643
985065308708535003255706381460651163285978881780861163082249011222312138775...714440718162721575545.3723981712971614198
985128446469155700618113524759668400505883914429007100613863651490352823221560...1039610621263052527343.2810024139399370
\n", "

5 rows × 68 columns

\n", "
" ], "text/plain": [ " 2021 Health Care 2021 Health Insurance \\\n", "StdGeographyID \n", "98501 125564861 82946494 \n", "98502 95864256 63381327 \n", "98503 93164805 61554303 \n", "98506 53087085 35003255 \n", "98512 84464691 55700618 \n", "\n", " 2021 Fee for Svc Health Plan Excl Blue Cross/Blue Shield \\\n", "StdGeographyID \n", "98501 17480482 \n", "98502 12919620 \n", "98503 12895647 \n", "98506 7063814 \n", "98512 11352475 \n", "\n", " 2021 HMO Excl Blue Cross/Blue Shield \\\n", "StdGeographyID \n", "98501 14884608 \n", "98502 11663331 \n", "98503 10808456 \n", "98506 6065116 \n", "98512 9668400 \n", "\n", " 2021 Medicare Supplements Excl Blue Cross/Blue Shield \\\n", "StdGeographyID \n", "98501 7170661 \n", "98502 5582000 \n", "98503 5387191 \n", "98506 3285978 \n", "98512 5058839 \n", "\n", " 2021 Blue Cross/Blue Shield \\\n", "StdGeographyID \n", "98501 21398640 \n", "98502 16126731 \n", "98503 16054938 \n", "98506 8817808 \n", "98512 14429007 \n", "\n", " 2021 Blue Cross/Blue Shield Fee-Svc Health Plan \\\n", "StdGeographyID \n", "98501 14855839 \n", "98502 11195309 \n", "98503 11152453 \n", "98506 6116308 \n", "98512 10061386 \n", "\n", " 2021 Blue Cross/Blue Shield HMO \\\n", "StdGeographyID \n", "98501 5515270 \n", "98502 4135517 \n", "98503 4129485 \n", "98506 2249011 \n", "98512 3651490 \n", "\n", " 2021 Blue Cross/Blue Shield Commercial Medicare Supplmt \\\n", "StdGeographyID \n", "98501 480180 \n", "98502 370443 \n", "98503 358297 \n", "98506 222312 \n", "98512 352823 \n", "\n", " 2021 Blue Cross/Blue Shield Dental Care Insur ... \\\n", "StdGeographyID ... \n", "98501 327850 ... \n", "98502 248872 ... \n", "98503 258021 ... \n", "98506 138775 ... \n", "98512 221560 ... \n", "\n", " 2021 Population Age 75-79 2021 Population Age 80-84 \\\n", "StdGeographyID \n", "98501 1516 945 \n", "98502 1141 674 \n", "98503 1326 915 \n", "98506 714 440 \n", "98512 1039 610 \n", "\n", " 2021 Population Age 85+ 2021 Population Age 18+ \\\n", "StdGeographyID \n", "98501 1028 36826 \n", "98502 794 29992 \n", "98503 1318 33065 \n", "98506 718 16272 \n", "98512 621 26305 \n", "\n", " 2021 Population Age 21+ 2021 Median Age \\\n", "StdGeographyID \n", "98501 35334 41.9 \n", "98502 28297 39.8 \n", "98503 31301 37.5 \n", "98506 15755 45.3 \n", "98512 25273 43.2 \n", "\n", " 2021 Median Household Income \\\n", "StdGeographyID \n", "98501 80354 \n", "98502 76355 \n", "98503 70669 \n", "98506 72398 \n", "98512 81002 \n", "\n", " 2021 Health Services - Businesses (SIC) \\\n", "StdGeographyID \n", "98501 112 \n", "98502 182 \n", "98503 96 \n", "98506 171 \n", "98512 41 \n", "\n", " 2021 Health Services - Sales ($000) (SIC) \\\n", "StdGeographyID \n", "98501 140533 \n", "98502 266062 \n", "98503 242272 \n", "98506 297161 \n", "98512 39399 \n", "\n", " 2021 Health Services - Employees (SIC) \n", "StdGeographyID \n", "98501 933 \n", "98502 1790 \n", "98503 1643 \n", "98506 4198 \n", "98512 370 \n", "\n", "[5 rows x 68 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alias_lst = [var_lookup.loc[c] if c in var_lookup.index else c for c in keep_col_lst]\n", "\n", "sel_df.columns = alias_lst\n", "\n", "sel_df.set_index('StdGeographyID', inplace=True)\n", "\n", "sel_df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Final Product - Export to Excel\n", "\n", "For follow on analysis using Excel, Pandas data frames can easily be saved to Excel." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "sel_df.to_excel(dir_int/'esri_enriched.xlsx')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 4 }