Getting Data for Medical Fraud Detection

The need, in this case, is to get potentially relevant variables to Excel with intelligible column names for follow on analyses for medical fraud detection using Excel.

This example will run with Python API version 1.9.1 if you are using a Web GIS, but this will use credits (a LOT) if you are using ArcGIS Online. Once the enrich method is added, this will run locally as well.

[1]:
from pathlib import Path
import os

from arcgis.features import GeoAccessor
from arcgis.geoenrichment import Country, enrich
from arcgis.gis import GIS
from dotenv import load_dotenv, find_dotenv
import pandas as pd
[2]:
# paths to common data locations
dir_prj = Path.cwd().parent
dir_data = dir_prj/'data'
dir_int = dir_data/'interim'

# load environment variables from .env
load_dotenv(find_dotenv())

# create a GIS object instance; if you did not enter any information here, it defaults to anonymous access to ArcGIS Online
gis = GIS(
    url=os.getenv('ESRI_GIS_URL'),
    username=os.getenv('ESRI_GIS_USERNAME'),
    password=None if len(os.getenv('ESRI_GIS_PASSWORD')) is 0 else os.getenv('ESRI_GIS_PASSWORD')
)

gis

Create a Country Object Instance

The starting point is creation of a Country object instance to work with.

[3]:
cntry = Country('usa', gis=gis)

cntry
[3]:
<Country - United States (GIS @ https://bateam.maps.arcgis.com version:9.4)>

Retrieve Standard Geographies for Analysis

Standard geographies can be retrieved from the country object. Especially for the CBSA’s, the exact string can be difficult to figure out. Thankfully, the `standard_geography_query method <https://developers.arcgis.com/python/api-reference/arcgis.geoenrichment.html#standard-geography-query>`__ can be used to search for the exact string to use for retrieving subgeographies.

[4]:
zip_dict = cntry.subgeographies.cbsa['Olympia-Lacey-Tumwater,_WA_Metropolitan_Statistical_Area'].zip5

zip_dict
[4]:
{'98501': <NamedArea name:"Olympia" area_id="98501", level="US.ZIP5", country="147">,
 '98502': <NamedArea name:"Olympia" area_id="98502", level="US.ZIP5", country="147">,
 '98503': <NamedArea name:"Lacey" area_id="98503", level="US.ZIP5", country="147">,
 '98506': <NamedArea name:"Olympia" area_id="98506", level="US.ZIP5", country="147">,
 '98512': <NamedArea name:"Olympia" area_id="98512", level="US.ZIP5", country="147">,
 '98513': <NamedArea name:"Olympia" area_id="98513", level="US.ZIP5", country="147">,
 '98516': <NamedArea name:"Olympia" area_id="98516", level="US.ZIP5", country="147">,
 '98530': <NamedArea name:"Bucoda" area_id="98530", level="US.ZIP5", country="147">,
 '98531': <NamedArea name:"Centralia" area_id="98531", level="US.ZIP5", country="147">,
 '98576': <NamedArea name:"Rainier" area_id="98576", level="US.ZIP5", country="147">,
 '98579': <NamedArea name:"Rochester" area_id="98579", level="US.ZIP5", country="147">,
 '98589': <NamedArea name:"Tenino" area_id="98589", level="US.ZIP5", country="147">,
 '98597': <NamedArea name:"Yelm" area_id="98597", level="US.ZIP5", country="147">}

Enrich

Get Variables for Enrichment

Using the filtering capabilities for Pandas data frames, we can quickly create a list of variables to work with. It is useful to note, even though I do not take advantage of it below, the Pandas ``contains` <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html>`__ method supports Python regular expression string syntax facilitating very powerful filtering.

[5]:
ev = cntry.enrich_variables

sv = ev[
    (ev.data_collection == 'Health')      # "Health" data collection
    & (ev.vintage.str.endswith('2021'))   # 2021 variables
    & (~ev.alias.str.contains('Avg:'))    # Exclude averages
    & (~ev.alias.str.contains('Index:'))  # Exclude index variables
].drop_duplicates('name').reset_index(drop=True)

sv
[5]:
name alias data_collection enrich_name enrich_field_name description vintage units
0 X8001_X 2021 Health Care Health Health.X8001_X Health_X8001_X 2021 Health Care 2021 currency
1 X8002_X 2021 Health Insurance Health Health.X8002_X Health_X8002_X 2021 Health Insurance 2021 currency
2 X8041_X 2021 Fee for Svc Health Plan Excl Blue Cross/B... Health Health.X8041_X Health_X8041_X 2021 Fee for Service Health Plan Excluding Blu... 2021 currency
3 X8012_X 2021 HMO Excl Blue Cross/Blue Shield Health Health.X8012_X Health_X8012_X 2021 HMO Excluding Blue Cross/Blue Shield 2021 currency
4 X8015_X 2021 Medicare Supplements Excl Blue Cross/Blue... Health Health.X8015_X Health_X8015_X 2021 Medicare Supplements Excluding Blue Cross... 2021 currency
... ... ... ... ... ... ... ... ...
63 MEDAGE_CY 2021 Median Age Health Health.MEDAGE_CY Health_MEDAGE_CY 2021 Median Age (Esri) 2021 count
64 MEDHINC_CY 2021 Median Household Income Health Health.MEDHINC_CY Health_MEDHINC_CY 2021 Median Household Income (Esri) 2021 currency
65 S27_BUS 2021 Health Services - Businesses (SIC) Health Health.S27_BUS Health_S27_BUS 2021 Health Services (SIC80) Businesses 2021 count
66 S27_SALES 2021 Health Services - Sales ($000) (SIC) Health Health.S27_SALES Health_S27_SALES 2021 Health Services (SIC80) Sales ($000) 2021 currency
67 S27_EMP 2021 Health Services - Employees (SIC) Health Health.S27_EMP Health_S27_EMP 2021 Health Services (SIC80) Employees 2021 count

68 rows × 8 columns

Perform Enrichment

Enrichment is as straightforward as running the `enrich method <https://developers.arcgis.com/python/api-reference/arcgis.geoenrichment.html#enrich>`__. Please notice the input for the analysis_variables parameter, a list of variable names. In the next release, you will be able to just input the filtered enrichment variables dataframe to make this easier, but for now, we still need to prepare the input for this parameter a bit.

[6]:
enrich_df = enrich(zip_dict, analysis_variables=list(sv.name), return_geometry=False, gis=gis)

enrich_df
[6]:
ID OBJECTID StdGeographyLevel StdGeographyName StdGeographyID sourceCountry aggregationMethod populationToPolygonSizeRating apportionmentConfidence HasData ... POP75_CY POP80_CY POP85_CY POP18UP_CY POP21UP_CY MEDAGE_CY MEDHINC_CY S27_BUS S27_SALES S27_EMP
0 0 1 US.ZIP5 Olympia 98501 US Query:US.ZIP5 2.191 2.576 1 ... 1516 945 1028 36826 35334 41.9 80354 112 140533 933
1 1 2 US.ZIP5 Olympia 98502 US Query:US.ZIP5 2.191 2.576 1 ... 1141 674 794 29992 28297 39.8 76355 182 266062 1790
2 2 3 US.ZIP5 Lacey 98503 US Query:US.ZIP5 2.191 2.576 1 ... 1326 915 1318 33065 31301 37.5 70669 96 242272 1643
3 3 4 US.ZIP5 Olympia 98506 US Query:US.ZIP5 2.191 2.576 1 ... 714 440 718 16272 15755 45.3 72398 171 297161 4198
4 4 5 US.ZIP5 Olympia 98512 US Query:US.ZIP5 2.191 2.576 1 ... 1039 610 621 26305 25273 43.2 81002 41 39399 370
5 5 6 US.ZIP5 Olympia 98513 US Query:US.ZIP5 2.191 2.576 1 ... 938 489 452 27363 26262 38.1 82593 4 1564 21
6 6 7 US.ZIP5 Olympia 98516 US Query:US.ZIP5 2.191 2.576 1 ... 1160 663 497 21949 21105 40.7 89478 53 68868 451
7 7 8 US.ZIP5 Bucoda 98530 US Query:US.ZIP5 2.191 2.576 1 ... 11 7 5 299 293 44.2 50000 0 0 0
8 8 9 US.ZIP5 Centralia 98531 US Query:US.ZIP5 2.191 2.576 1 ... 901 670 861 19766 18864 38.3 52691 88 112573 1719
9 9 10 US.ZIP5 Rainier 98576 US Query:US.ZIP5 2.191 2.576 1 ... 151 77 66 4491 4319 44.4 78361 1 194 1
10 10 11 US.ZIP5 Rochester 98579 US Query:US.ZIP5 2.191 2.576 1 ... 466 260 172 11361 10874 42.1 76899 6 9334 55
11 11 12 US.ZIP5 Tenino 98589 US Query:US.ZIP5 2.191 2.576 1 ... 303 168 131 7238 6981 45.4 70765 6 3377 27
12 12 13 US.ZIP5 Yelm 98597 US Query:US.ZIP5 2.191 2.576 1 ... 647 371 303 20302 19411 37.5 74969 32 30649 200

13 rows × 78 columns

Add Aliases

Create Pandas Series for Alias Lookup

We can create a Pandas series enabling easy column alias lookup by removing duplicate names, set the index to the column name, and just keeping the alias column.

[7]:
var_lookup = ev.drop_duplicates('name').set_index('name')['alias']

var_lookup
[7]:
name
AGE0_CY                                  2021 Population Age <1
AGE1_CY                                   2021 Population Age 1
AGE2_CY                                   2021 Population Age 2
AGE3_CY                                   2021 Population Age 3
AGE4_CY                                   2021 Population Age 4
                                    ...
RELRMV1989       2019 RHHs/Moved In: 1989/Before REL (ACS 5-Yr)
ACSMEDYRMV     2019 Median Year Householder Moved In (ACS 5-Yr)
MOEMEDYRMV    2019 Median Year Householder Moved In MOE (ACS...
RELMEDYRMV    2019 Median Year Householder Moved In REL (ACS...
RELOWNER                   2019 Owner Households REL (ACS 5-Yr)
Name: alias, Length: 16275, dtype: object

Use Alias List to Look Up Relevant Column Aliases

Using a ternerary operator in a list comprehension with the Pandas Series created in the last step enables us to look up aliases if there is a match and keep the existing column name if there is not a match. This enables us to create a list of column names for the output data.

[8]:
alias_lst = [var_lookup.loc[c] if c in var_lookup.index else c for c in enrich_df.columns]

alias_lst
[8]:
['ID',
 'OBJECTID',
 'StdGeographyLevel',
 'StdGeographyName',
 'StdGeographyID',
 'sourceCountry',
 'aggregationMethod',
 'populationToPolygonSizeRating',
 'apportionmentConfidence',
 'HasData',
 '2021 Health Care',
 '2021 Health Insurance',
 '2021 Fee for Svc Health Plan Excl Blue Cross/Blue Shield',
 '2021 HMO Excl Blue Cross/Blue Shield',
 '2021 Medicare Supplements Excl Blue Cross/Blue Shield',
 '2021 Blue Cross/Blue Shield',
 '2021 Blue Cross/Blue Shield Fee-Svc Health Plan',
 '2021 Blue Cross/Blue Shield HMO',
 '2021 Blue Cross/Blue Shield Commercial Medicare Supplmt',
 '2021 Blue Cross/Blue Shield Dental Care Insur',
 '2021 Blue Cross/Blue Shield Vision Care Insur',
 '2021 Blue Cross/Blue Shield Prescription Drug Insur',
 '2021 Blue Cross/Blue Shield Long Term Care Insur',
 '2021 Blue Cross/Blue Shield Oth Single Svc Insur',
 '2021 Medicare Payments',
 '2021 Medicaid Premiums',
 '2021 Tricare/Military Premiums',
 '2021 Children`s Health Insur Program (CHIP) Premiums',
 '2021 Medical Care',
 '2021 Medical Services',
 '2021 Physician Services',
 '2021 Dental Services',
 '2021 Eyecare Services',
 '2021 Non-Physician Svcs Inside Home',
 '2021 Non-Physician Svcs Outside Home',
 '2021 Lab Tests/X-Rays',
 '2021 Hospital Room/Hospital Svc',
 '2021 Convalescent/Nursing Home Care',
 '2021 Oth Medical Svcs',
 '2021 Drugs/Vitamins',
 '2021 NonPrescription Drugs',
 '2021 Prescription Drugs',
 '2021 NonPrescription Vitamins',
 '2021 Medicare Prescription Drug Premium',
 '2021 Medical Supplies',
 '2021 Eyeglasses/Contact Lenses',
 '2021 Hearing Aids',
 '2021 Topicals/Dressings',
 '2021 Medical Equip-General Use',
 '2021 Supportive/Convalescent Medical Equip',
 '2021 Rental of Medical Equip-General Use',
 '2021 Rental of Supportive/Convalescent Medical Equip',
 '2021 Dom Tapestry Segment Name',
 '2021 Population Age 0-4',
 '2021 Population Age 5-9',
 '2021 Population Age 10-14',
 '2021 Population Age 15-19',
 '2021 Population Age 20-24',
 '2021 Population Age 25-29',
 '2021 Population Age 30-34',
 '2021 Population Age 35-39',
 '2021 Population Age 40-44',
 '2021 Population Age 45-49',
 '2021 Population Age 50-54',
 '2021 Population Age 55-59',
 '2021 Population Age 60-64',
 '2021 Population Age 65-69',
 '2021 Population Age 70-74',
 '2021 Population Age 75-79',
 '2021 Population Age 80-84',
 '2021 Population Age 85+',
 '2021 Population Age 18+',
 '2021 Population Age 21+',
 '2021 Median Age',
 '2021 Median Household Income',
 '2021 Health Services - Businesses (SIC)',
 '2021 Health Services - Sales ($000) (SIC)',
 '2021 Health Services - Employees (SIC)']

Prune Columns

If the intention for the output data is for subsequent analysis, it is easier to just have the unique identifier, in this case the zip code, and the enriched columns in the final output. We can create a list of these columns using a list comprehension to filter the column names.

[9]:
id_col = 'StdGeographyID'

keep_col_lst = [id_col] + [c for c in enrich_df.columns if c in var_lookup.index]

sel_df = enrich_df.loc[:,keep_col_lst]

sel_df.head()
[9]:
StdGeographyID X8001_X X8002_X X8041_X X8012_X X8015_X X8006_X X8042_X X8009_X X8010_X ... POP75_CY POP80_CY POP85_CY POP18UP_CY POP21UP_CY MEDAGE_CY MEDHINC_CY S27_BUS S27_SALES S27_EMP
0 98501 125564861 82946494 17480482 14884608 7170661 21398640 14855839 5515270 480180 ... 1516 945 1028 36826 35334 41.9 80354 112 140533 933
1 98502 95864256 63381327 12919620 11663331 5582000 16126731 11195309 4135517 370443 ... 1141 674 794 29992 28297 39.8 76355 182 266062 1790
2 98503 93164805 61554303 12895647 10808456 5387191 16054938 11152453 4129485 358297 ... 1326 915 1318 33065 31301 37.5 70669 96 242272 1643
3 98506 53087085 35003255 7063814 6065116 3285978 8817808 6116308 2249011 222312 ... 714 440 718 16272 15755 45.3 72398 171 297161 4198
4 98512 84464691 55700618 11352475 9668400 5058839 14429007 10061386 3651490 352823 ... 1039 610 621 26305 25273 43.2 81002 41 39399 370

5 rows × 69 columns

Apply Aliases

Using the same method as above, we can create a list of aliases. These aliases can then be applied to the output data frame. Also, to faciliate quick retrieval by ID, we can set the index to this ID.

[10]:
alias_lst = [var_lookup.loc[c] if c in var_lookup.index else c for c in keep_col_lst]

sel_df.columns = alias_lst

sel_df.set_index('StdGeographyID', inplace=True)

sel_df.head()
[10]:
2021 Health Care 2021 Health Insurance 2021 Fee for Svc Health Plan Excl Blue Cross/Blue Shield 2021 HMO Excl Blue Cross/Blue Shield 2021 Medicare Supplements Excl Blue Cross/Blue Shield 2021 Blue Cross/Blue Shield 2021 Blue Cross/Blue Shield Fee-Svc Health Plan 2021 Blue Cross/Blue Shield HMO 2021 Blue Cross/Blue Shield Commercial Medicare Supplmt 2021 Blue Cross/Blue Shield Dental Care Insur ... 2021 Population Age 75-79 2021 Population Age 80-84 2021 Population Age 85+ 2021 Population Age 18+ 2021 Population Age 21+ 2021 Median Age 2021 Median Household Income 2021 Health Services - Businesses (SIC) 2021 Health Services - Sales ($000) (SIC) 2021 Health Services - Employees (SIC)
StdGeographyID
98501 125564861 82946494 17480482 14884608 7170661 21398640 14855839 5515270 480180 327850 ... 1516 945 1028 36826 35334 41.9 80354 112 140533 933
98502 95864256 63381327 12919620 11663331 5582000 16126731 11195309 4135517 370443 248872 ... 1141 674 794 29992 28297 39.8 76355 182 266062 1790
98503 93164805 61554303 12895647 10808456 5387191 16054938 11152453 4129485 358297 258021 ... 1326 915 1318 33065 31301 37.5 70669 96 242272 1643
98506 53087085 35003255 7063814 6065116 3285978 8817808 6116308 2249011 222312 138775 ... 714 440 718 16272 15755 45.3 72398 171 297161 4198
98512 84464691 55700618 11352475 9668400 5058839 14429007 10061386 3651490 352823 221560 ... 1039 610 621 26305 25273 43.2 81002 41 39399 370

5 rows × 68 columns

Final Product - Export to Excel

For follow on analysis using Excel, Pandas data frames can easily be saved to Excel.

[11]:
sel_df.to_excel(dir_int/'esri_enriched.xlsx')