Getting Data for Medical Fraud Detection¶
The need, in this case, is to get potentially relevant variables to Excel with intelligible column names for follow on analyses for medical fraud detection using Excel.
This example will run with Python API version 1.9.1 if you are using a Web GIS, but this will use credits (a LOT) if you are using ArcGIS Online. Once the enrich
method is added, this will run locally as well.
[1]:
from pathlib import Path
import os
from arcgis.features import GeoAccessor
from arcgis.geoenrichment import Country, enrich
from arcgis.gis import GIS
from dotenv import load_dotenv, find_dotenv
import pandas as pd
[2]:
# paths to common data locations
dir_prj = Path.cwd().parent
dir_data = dir_prj/'data'
dir_int = dir_data/'interim'
# load environment variables from .env
load_dotenv(find_dotenv())
# create a GIS object instance; if you did not enter any information here, it defaults to anonymous access to ArcGIS Online
gis = GIS(
url=os.getenv('ESRI_GIS_URL'),
username=os.getenv('ESRI_GIS_USERNAME'),
password=None if len(os.getenv('ESRI_GIS_PASSWORD')) is 0 else os.getenv('ESRI_GIS_PASSWORD')
)
gis
[2]:
Create a Country Object Instance¶
The starting point is creation of a Country
object instance to work with.
[3]:
cntry = Country('usa', gis=gis)
cntry
[3]:
<Country - United States (GIS @ https://bateam.maps.arcgis.com version:9.4)>
Retrieve Standard Geographies for Analysis¶
Standard geographies can be retrieved from the country object. Especially for the CBSA’s, the exact string can be difficult to figure out. Thankfully, the `standard_geography_query
method <https://developers.arcgis.com/python/api-reference/arcgis.geoenrichment.html#standard-geography-query>`__ can be used to search for the exact string to use for retrieving subgeographies.
[4]:
zip_dict = cntry.subgeographies.cbsa['Olympia-Lacey-Tumwater,_WA_Metropolitan_Statistical_Area'].zip5
zip_dict
[4]:
{'98501': <NamedArea name:"Olympia" area_id="98501", level="US.ZIP5", country="147">,
'98502': <NamedArea name:"Olympia" area_id="98502", level="US.ZIP5", country="147">,
'98503': <NamedArea name:"Lacey" area_id="98503", level="US.ZIP5", country="147">,
'98506': <NamedArea name:"Olympia" area_id="98506", level="US.ZIP5", country="147">,
'98512': <NamedArea name:"Olympia" area_id="98512", level="US.ZIP5", country="147">,
'98513': <NamedArea name:"Olympia" area_id="98513", level="US.ZIP5", country="147">,
'98516': <NamedArea name:"Olympia" area_id="98516", level="US.ZIP5", country="147">,
'98530': <NamedArea name:"Bucoda" area_id="98530", level="US.ZIP5", country="147">,
'98531': <NamedArea name:"Centralia" area_id="98531", level="US.ZIP5", country="147">,
'98576': <NamedArea name:"Rainier" area_id="98576", level="US.ZIP5", country="147">,
'98579': <NamedArea name:"Rochester" area_id="98579", level="US.ZIP5", country="147">,
'98589': <NamedArea name:"Tenino" area_id="98589", level="US.ZIP5", country="147">,
'98597': <NamedArea name:"Yelm" area_id="98597", level="US.ZIP5", country="147">}
Enrich¶
Get Variables for Enrichment¶
Using the filtering capabilities for Pandas data frames, we can quickly create a list of variables to work with. It is useful to note, even though I do not take advantage of it below, the Pandas ``contains` <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html>`__ method supports Python regular expression string syntax facilitating very powerful filtering.
[5]:
ev = cntry.enrich_variables
sv = ev[
(ev.data_collection == 'Health') # "Health" data collection
& (ev.vintage.str.endswith('2021')) # 2021 variables
& (~ev.alias.str.contains('Avg:')) # Exclude averages
& (~ev.alias.str.contains('Index:')) # Exclude index variables
].drop_duplicates('name').reset_index(drop=True)
sv
[5]:
name | alias | data_collection | enrich_name | enrich_field_name | description | vintage | units | |
---|---|---|---|---|---|---|---|---|
0 | X8001_X | 2021 Health Care | Health | Health.X8001_X | Health_X8001_X | 2021 Health Care | 2021 | currency |
1 | X8002_X | 2021 Health Insurance | Health | Health.X8002_X | Health_X8002_X | 2021 Health Insurance | 2021 | currency |
2 | X8041_X | 2021 Fee for Svc Health Plan Excl Blue Cross/B... | Health | Health.X8041_X | Health_X8041_X | 2021 Fee for Service Health Plan Excluding Blu... | 2021 | currency |
3 | X8012_X | 2021 HMO Excl Blue Cross/Blue Shield | Health | Health.X8012_X | Health_X8012_X | 2021 HMO Excluding Blue Cross/Blue Shield | 2021 | currency |
4 | X8015_X | 2021 Medicare Supplements Excl Blue Cross/Blue... | Health | Health.X8015_X | Health_X8015_X | 2021 Medicare Supplements Excluding Blue Cross... | 2021 | currency |
... | ... | ... | ... | ... | ... | ... | ... | ... |
63 | MEDAGE_CY | 2021 Median Age | Health | Health.MEDAGE_CY | Health_MEDAGE_CY | 2021 Median Age (Esri) | 2021 | count |
64 | MEDHINC_CY | 2021 Median Household Income | Health | Health.MEDHINC_CY | Health_MEDHINC_CY | 2021 Median Household Income (Esri) | 2021 | currency |
65 | S27_BUS | 2021 Health Services - Businesses (SIC) | Health | Health.S27_BUS | Health_S27_BUS | 2021 Health Services (SIC80) Businesses | 2021 | count |
66 | S27_SALES | 2021 Health Services - Sales ($000) (SIC) | Health | Health.S27_SALES | Health_S27_SALES | 2021 Health Services (SIC80) Sales ($000) | 2021 | currency |
67 | S27_EMP | 2021 Health Services - Employees (SIC) | Health | Health.S27_EMP | Health_S27_EMP | 2021 Health Services (SIC80) Employees | 2021 | count |
68 rows × 8 columns
Perform Enrichment¶
Enrichment is as straightforward as running the `enrich
method <https://developers.arcgis.com/python/api-reference/arcgis.geoenrichment.html#enrich>`__. Please notice the input for the analysis_variables
parameter, a list of variable names. In the next release, you will be able to just input the filtered enrichment variables dataframe to make this easier, but for now, we still need to prepare the input for this parameter a bit.
[6]:
enrich_df = enrich(zip_dict, analysis_variables=list(sv.name), return_geometry=False, gis=gis)
enrich_df
[6]:
ID | OBJECTID | StdGeographyLevel | StdGeographyName | StdGeographyID | sourceCountry | aggregationMethod | populationToPolygonSizeRating | apportionmentConfidence | HasData | ... | POP75_CY | POP80_CY | POP85_CY | POP18UP_CY | POP21UP_CY | MEDAGE_CY | MEDHINC_CY | S27_BUS | S27_SALES | S27_EMP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | US.ZIP5 | Olympia | 98501 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 1516 | 945 | 1028 | 36826 | 35334 | 41.9 | 80354 | 112 | 140533 | 933 |
1 | 1 | 2 | US.ZIP5 | Olympia | 98502 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 1141 | 674 | 794 | 29992 | 28297 | 39.8 | 76355 | 182 | 266062 | 1790 |
2 | 2 | 3 | US.ZIP5 | Lacey | 98503 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 1326 | 915 | 1318 | 33065 | 31301 | 37.5 | 70669 | 96 | 242272 | 1643 |
3 | 3 | 4 | US.ZIP5 | Olympia | 98506 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 714 | 440 | 718 | 16272 | 15755 | 45.3 | 72398 | 171 | 297161 | 4198 |
4 | 4 | 5 | US.ZIP5 | Olympia | 98512 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 1039 | 610 | 621 | 26305 | 25273 | 43.2 | 81002 | 41 | 39399 | 370 |
5 | 5 | 6 | US.ZIP5 | Olympia | 98513 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 938 | 489 | 452 | 27363 | 26262 | 38.1 | 82593 | 4 | 1564 | 21 |
6 | 6 | 7 | US.ZIP5 | Olympia | 98516 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 1160 | 663 | 497 | 21949 | 21105 | 40.7 | 89478 | 53 | 68868 | 451 |
7 | 7 | 8 | US.ZIP5 | Bucoda | 98530 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 11 | 7 | 5 | 299 | 293 | 44.2 | 50000 | 0 | 0 | 0 |
8 | 8 | 9 | US.ZIP5 | Centralia | 98531 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 901 | 670 | 861 | 19766 | 18864 | 38.3 | 52691 | 88 | 112573 | 1719 |
9 | 9 | 10 | US.ZIP5 | Rainier | 98576 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 151 | 77 | 66 | 4491 | 4319 | 44.4 | 78361 | 1 | 194 | 1 |
10 | 10 | 11 | US.ZIP5 | Rochester | 98579 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 466 | 260 | 172 | 11361 | 10874 | 42.1 | 76899 | 6 | 9334 | 55 |
11 | 11 | 12 | US.ZIP5 | Tenino | 98589 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 303 | 168 | 131 | 7238 | 6981 | 45.4 | 70765 | 6 | 3377 | 27 |
12 | 12 | 13 | US.ZIP5 | Yelm | 98597 | US | Query:US.ZIP5 | 2.191 | 2.576 | 1 | ... | 647 | 371 | 303 | 20302 | 19411 | 37.5 | 74969 | 32 | 30649 | 200 |
13 rows × 78 columns
Add Aliases¶
Create Pandas Series for Alias Lookup¶
We can create a Pandas series enabling easy column alias lookup by removing duplicate names, set the index to the column name, and just keeping the alias column.
[7]:
var_lookup = ev.drop_duplicates('name').set_index('name')['alias']
var_lookup
[7]:
name
AGE0_CY 2021 Population Age <1
AGE1_CY 2021 Population Age 1
AGE2_CY 2021 Population Age 2
AGE3_CY 2021 Population Age 3
AGE4_CY 2021 Population Age 4
...
RELRMV1989 2019 RHHs/Moved In: 1989/Before REL (ACS 5-Yr)
ACSMEDYRMV 2019 Median Year Householder Moved In (ACS 5-Yr)
MOEMEDYRMV 2019 Median Year Householder Moved In MOE (ACS...
RELMEDYRMV 2019 Median Year Householder Moved In REL (ACS...
RELOWNER 2019 Owner Households REL (ACS 5-Yr)
Name: alias, Length: 16275, dtype: object
Use Alias List to Look Up Relevant Column Aliases¶
Using a ternerary operator in a list comprehension with the Pandas Series created in the last step enables us to look up aliases if there is a match and keep the existing column name if there is not a match. This enables us to create a list of column names for the output data.
[8]:
alias_lst = [var_lookup.loc[c] if c in var_lookup.index else c for c in enrich_df.columns]
alias_lst
[8]:
['ID',
'OBJECTID',
'StdGeographyLevel',
'StdGeographyName',
'StdGeographyID',
'sourceCountry',
'aggregationMethod',
'populationToPolygonSizeRating',
'apportionmentConfidence',
'HasData',
'2021 Health Care',
'2021 Health Insurance',
'2021 Fee for Svc Health Plan Excl Blue Cross/Blue Shield',
'2021 HMO Excl Blue Cross/Blue Shield',
'2021 Medicare Supplements Excl Blue Cross/Blue Shield',
'2021 Blue Cross/Blue Shield',
'2021 Blue Cross/Blue Shield Fee-Svc Health Plan',
'2021 Blue Cross/Blue Shield HMO',
'2021 Blue Cross/Blue Shield Commercial Medicare Supplmt',
'2021 Blue Cross/Blue Shield Dental Care Insur',
'2021 Blue Cross/Blue Shield Vision Care Insur',
'2021 Blue Cross/Blue Shield Prescription Drug Insur',
'2021 Blue Cross/Blue Shield Long Term Care Insur',
'2021 Blue Cross/Blue Shield Oth Single Svc Insur',
'2021 Medicare Payments',
'2021 Medicaid Premiums',
'2021 Tricare/Military Premiums',
'2021 Children`s Health Insur Program (CHIP) Premiums',
'2021 Medical Care',
'2021 Medical Services',
'2021 Physician Services',
'2021 Dental Services',
'2021 Eyecare Services',
'2021 Non-Physician Svcs Inside Home',
'2021 Non-Physician Svcs Outside Home',
'2021 Lab Tests/X-Rays',
'2021 Hospital Room/Hospital Svc',
'2021 Convalescent/Nursing Home Care',
'2021 Oth Medical Svcs',
'2021 Drugs/Vitamins',
'2021 NonPrescription Drugs',
'2021 Prescription Drugs',
'2021 NonPrescription Vitamins',
'2021 Medicare Prescription Drug Premium',
'2021 Medical Supplies',
'2021 Eyeglasses/Contact Lenses',
'2021 Hearing Aids',
'2021 Topicals/Dressings',
'2021 Medical Equip-General Use',
'2021 Supportive/Convalescent Medical Equip',
'2021 Rental of Medical Equip-General Use',
'2021 Rental of Supportive/Convalescent Medical Equip',
'2021 Dom Tapestry Segment Name',
'2021 Population Age 0-4',
'2021 Population Age 5-9',
'2021 Population Age 10-14',
'2021 Population Age 15-19',
'2021 Population Age 20-24',
'2021 Population Age 25-29',
'2021 Population Age 30-34',
'2021 Population Age 35-39',
'2021 Population Age 40-44',
'2021 Population Age 45-49',
'2021 Population Age 50-54',
'2021 Population Age 55-59',
'2021 Population Age 60-64',
'2021 Population Age 65-69',
'2021 Population Age 70-74',
'2021 Population Age 75-79',
'2021 Population Age 80-84',
'2021 Population Age 85+',
'2021 Population Age 18+',
'2021 Population Age 21+',
'2021 Median Age',
'2021 Median Household Income',
'2021 Health Services - Businesses (SIC)',
'2021 Health Services - Sales ($000) (SIC)',
'2021 Health Services - Employees (SIC)']
Prune Columns¶
If the intention for the output data is for subsequent analysis, it is easier to just have the unique identifier, in this case the zip code, and the enriched columns in the final output. We can create a list of these columns using a list comprehension to filter the column names.
[9]:
id_col = 'StdGeographyID'
keep_col_lst = [id_col] + [c for c in enrich_df.columns if c in var_lookup.index]
sel_df = enrich_df.loc[:,keep_col_lst]
sel_df.head()
[9]:
StdGeographyID | X8001_X | X8002_X | X8041_X | X8012_X | X8015_X | X8006_X | X8042_X | X8009_X | X8010_X | ... | POP75_CY | POP80_CY | POP85_CY | POP18UP_CY | POP21UP_CY | MEDAGE_CY | MEDHINC_CY | S27_BUS | S27_SALES | S27_EMP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 98501 | 125564861 | 82946494 | 17480482 | 14884608 | 7170661 | 21398640 | 14855839 | 5515270 | 480180 | ... | 1516 | 945 | 1028 | 36826 | 35334 | 41.9 | 80354 | 112 | 140533 | 933 |
1 | 98502 | 95864256 | 63381327 | 12919620 | 11663331 | 5582000 | 16126731 | 11195309 | 4135517 | 370443 | ... | 1141 | 674 | 794 | 29992 | 28297 | 39.8 | 76355 | 182 | 266062 | 1790 |
2 | 98503 | 93164805 | 61554303 | 12895647 | 10808456 | 5387191 | 16054938 | 11152453 | 4129485 | 358297 | ... | 1326 | 915 | 1318 | 33065 | 31301 | 37.5 | 70669 | 96 | 242272 | 1643 |
3 | 98506 | 53087085 | 35003255 | 7063814 | 6065116 | 3285978 | 8817808 | 6116308 | 2249011 | 222312 | ... | 714 | 440 | 718 | 16272 | 15755 | 45.3 | 72398 | 171 | 297161 | 4198 |
4 | 98512 | 84464691 | 55700618 | 11352475 | 9668400 | 5058839 | 14429007 | 10061386 | 3651490 | 352823 | ... | 1039 | 610 | 621 | 26305 | 25273 | 43.2 | 81002 | 41 | 39399 | 370 |
5 rows × 69 columns
Apply Aliases¶
Using the same method as above, we can create a list of aliases. These aliases can then be applied to the output data frame. Also, to faciliate quick retrieval by ID, we can set the index to this ID.
[10]:
alias_lst = [var_lookup.loc[c] if c in var_lookup.index else c for c in keep_col_lst]
sel_df.columns = alias_lst
sel_df.set_index('StdGeographyID', inplace=True)
sel_df.head()
[10]:
2021 Health Care | 2021 Health Insurance | 2021 Fee for Svc Health Plan Excl Blue Cross/Blue Shield | 2021 HMO Excl Blue Cross/Blue Shield | 2021 Medicare Supplements Excl Blue Cross/Blue Shield | 2021 Blue Cross/Blue Shield | 2021 Blue Cross/Blue Shield Fee-Svc Health Plan | 2021 Blue Cross/Blue Shield HMO | 2021 Blue Cross/Blue Shield Commercial Medicare Supplmt | 2021 Blue Cross/Blue Shield Dental Care Insur | ... | 2021 Population Age 75-79 | 2021 Population Age 80-84 | 2021 Population Age 85+ | 2021 Population Age 18+ | 2021 Population Age 21+ | 2021 Median Age | 2021 Median Household Income | 2021 Health Services - Businesses (SIC) | 2021 Health Services - Sales ($000) (SIC) | 2021 Health Services - Employees (SIC) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
StdGeographyID | |||||||||||||||||||||
98501 | 125564861 | 82946494 | 17480482 | 14884608 | 7170661 | 21398640 | 14855839 | 5515270 | 480180 | 327850 | ... | 1516 | 945 | 1028 | 36826 | 35334 | 41.9 | 80354 | 112 | 140533 | 933 |
98502 | 95864256 | 63381327 | 12919620 | 11663331 | 5582000 | 16126731 | 11195309 | 4135517 | 370443 | 248872 | ... | 1141 | 674 | 794 | 29992 | 28297 | 39.8 | 76355 | 182 | 266062 | 1790 |
98503 | 93164805 | 61554303 | 12895647 | 10808456 | 5387191 | 16054938 | 11152453 | 4129485 | 358297 | 258021 | ... | 1326 | 915 | 1318 | 33065 | 31301 | 37.5 | 70669 | 96 | 242272 | 1643 |
98506 | 53087085 | 35003255 | 7063814 | 6065116 | 3285978 | 8817808 | 6116308 | 2249011 | 222312 | 138775 | ... | 714 | 440 | 718 | 16272 | 15755 | 45.3 | 72398 | 171 | 297161 | 4198 |
98512 | 84464691 | 55700618 | 11352475 | 9668400 | 5058839 | 14429007 | 10061386 | 3651490 | 352823 | 221560 | ... | 1039 | 610 | 621 | 26305 | 25273 | 43.2 | 81002 | 41 | 39399 | 370 |
5 rows × 68 columns
Final Product - Export to Excel¶
For follow on analysis using Excel, Pandas data frames can easily be saved to Excel.
[11]:
sel_df.to_excel(dir_int/'esri_enriched.xlsx')