Getting Data for Medical Fraud Detection¶

The need, in this case, is to get potentially relevant variables to Excel with intelligible column names for follow on analyses for medical fraud detection using Excel.

This example will run with Python API version 1.9.1 if you are using a Web GIS, but this will use credits (a LOT) if you are using ArcGIS Online. Once the enrich method is added, this will run locally as well.

[1]:

from pathlib import Path
import os

from arcgis.features import GeoAccessor
from arcgis.geoenrichment import Country, enrich
from arcgis.gis import GIS
from dotenv import load_dotenv, find_dotenv
import pandas as pd

[2]:

# paths to common data locations
dir_prj = Path.cwd().parent
dir_data = dir_prj/'data'
dir_int = dir_data/'interim'

# load environment variables from .env
load_dotenv(find_dotenv())

# create a GIS object instance; if you did not enter any information here, it defaults to anonymous access to ArcGIS Online
gis = GIS(
    url=os.getenv('ESRI_GIS_URL'),
    username=os.getenv('ESRI_GIS_USERNAME'),
    password=None if len(os.getenv('ESRI_GIS_PASSWORD')) is 0 else os.getenv('ESRI_GIS_PASSWORD')
)

gis

[2]:

GIS @ https://bateam.maps.arcgis.com

Create a Country Object Instance¶

The starting point is creation of a Country object instance to work with.

[3]:

cntry = Country('usa', gis=gis)

cntry

[3]:

<Country - United States (GIS @ https://bateam.maps.arcgis.com version:9.4)>

Retrieve Standard Geographies for Analysis¶

Standard geographies can be retrieved from the country object. Especially for the CBSA’s, the exact string can be difficult to figure out. Thankfully, the `standard_geography_query method <https://developers.arcgis.com/python/api-reference/arcgis.geoenrichment.html#standard-geography-query>`__ can be used to search for the exact string to use for retrieving subgeographies.

[4]:

zip_dict = cntry.subgeographies.cbsa['Olympia-Lacey-Tumwater,_WA_Metropolitan_Statistical_Area'].zip5

zip_dict

[4]:

{'98501': <NamedArea name:"Olympia" area_id="98501", level="US.ZIP5", country="147">,
 '98502': <NamedArea name:"Olympia" area_id="98502", level="US.ZIP5", country="147">,
 '98503': <NamedArea name:"Lacey" area_id="98503", level="US.ZIP5", country="147">,
 '98506': <NamedArea name:"Olympia" area_id="98506", level="US.ZIP5", country="147">,
 '98512': <NamedArea name:"Olympia" area_id="98512", level="US.ZIP5", country="147">,
 '98513': <NamedArea name:"Olympia" area_id="98513", level="US.ZIP5", country="147">,
 '98516': <NamedArea name:"Olympia" area_id="98516", level="US.ZIP5", country="147">,
 '98530': <NamedArea name:"Bucoda" area_id="98530", level="US.ZIP5", country="147">,
 '98531': <NamedArea name:"Centralia" area_id="98531", level="US.ZIP5", country="147">,
 '98576': <NamedArea name:"Rainier" area_id="98576", level="US.ZIP5", country="147">,
 '98579': <NamedArea name:"Rochester" area_id="98579", level="US.ZIP5", country="147">,
 '98589': <NamedArea name:"Tenino" area_id="98589", level="US.ZIP5", country="147">,
 '98597': <NamedArea name:"Yelm" area_id="98597", level="US.ZIP5", country="147">}

Enrich¶

Get Variables for Enrichment¶

Using the filtering capabilities for Pandas data frames, we can quickly create a list of variables to work with. It is useful to note, even though I do not take advantage of it below, the Pandas ``contains` <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html>`__ method supports Python regular expression string syntax facilitating very powerful filtering.

[5]:

ev = cntry.enrich_variables

sv = ev[
    (ev.data_collection == 'Health')      # "Health" data collection
    & (ev.vintage.str.endswith('2021'))   # 2021 variables
    & (~ev.alias.str.contains('Avg:'))    # Exclude averages
    & (~ev.alias.str.contains('Index:'))  # Exclude index variables
].drop_duplicates('name').reset_index(drop=True)

sv

[5]:

	name	alias	data_collection	enrich_name	enrich_field_name	description	vintage	units
0	X8001_X	2021 Health Care	Health	Health.X8001_X	Health_X8001_X	2021 Health Care	2021	currency
1	X8002_X	2021 Health Insurance	Health	Health.X8002_X	Health_X8002_X	2021 Health Insurance	2021	currency
2	X8041_X	2021 Fee for Svc Health Plan Excl Blue Cross/B...	Health	Health.X8041_X	Health_X8041_X	2021 Fee for Service Health Plan Excluding Blu...	2021	currency
3	X8012_X	2021 HMO Excl Blue Cross/Blue Shield	Health	Health.X8012_X	Health_X8012_X	2021 HMO Excluding Blue Cross/Blue Shield	2021	currency
4	X8015_X	2021 Medicare Supplements Excl Blue Cross/Blue...	Health	Health.X8015_X	Health_X8015_X	2021 Medicare Supplements Excluding Blue Cross...	2021	currency
...	...	...	...	...	...	...	...	...
63	MEDAGE_CY	2021 Median Age	Health	Health.MEDAGE_CY	Health_MEDAGE_CY	2021 Median Age (Esri)	2021	count
64	MEDHINC_CY	2021 Median Household Income	Health	Health.MEDHINC_CY	Health_MEDHINC_CY	2021 Median Household Income (Esri)	2021	currency
65	S27_BUS	2021 Health Services - Businesses (SIC)	Health	Health.S27_BUS	Health_S27_BUS	2021 Health Services (SIC80) Businesses	2021	count
66	S27_SALES	2021 Health Services - Sales ($000) (SIC)	Health	Health.S27_SALES	Health_S27_SALES	2021 Health Services (SIC80) Sales ($000)	2021	currency
67	S27_EMP	2021 Health Services - Employees (SIC)	Health	Health.S27_EMP	Health_S27_EMP	2021 Health Services (SIC80) Employees	2021	count

68 rows × 8 columns

Perform Enrichment¶

Enrichment is as straightforward as running the `enrich method <https://developers.arcgis.com/python/api-reference/arcgis.geoenrichment.html#enrich>`__. Please notice the input for the analysis_variables parameter, a list of variable names. In the next release, you will be able to just input the filtered enrichment variables dataframe to make this easier, but for now, we still need to prepare the input for this parameter a bit.

[6]:

enrich_df = enrich(zip_dict, analysis_variables=list(sv.name), return_geometry=False, gis=gis)

enrich_df

[6]:

	ID	OBJECTID	StdGeographyLevel	StdGeographyName	StdGeographyID	sourceCountry	aggregationMethod	populationToPolygonSizeRating	apportionmentConfidence	HasData	...	POP75_CY	POP80_CY	POP85_CY	POP18UP_CY	POP21UP_CY	MEDAGE_CY	MEDHINC_CY	S27_BUS	S27_SALES	S27_EMP
0	0	1	US.ZIP5	Olympia	98501	US	Query:US.ZIP5	2.191	2.576	1	...	1516	945	1028	36826	35334	41.9	80354	112	140533	933
1	1	2	US.ZIP5	Olympia	98502	US	Query:US.ZIP5	2.191	2.576	1	...	1141	674	794	29992	28297	39.8	76355	182	266062	1790
2	2	3	US.ZIP5	Lacey	98503	US	Query:US.ZIP5	2.191	2.576	1	...	1326	915	1318	33065	31301	37.5	70669	96	242272	1643
3	3	4	US.ZIP5	Olympia	98506	US	Query:US.ZIP5	2.191	2.576	1	...	714	440	718	16272	15755	45.3	72398	171	297161	4198
4	4	5	US.ZIP5	Olympia	98512	US	Query:US.ZIP5	2.191	2.576	1	...	1039	610	621	26305	25273	43.2	81002	41	39399	370
5	5	6	US.ZIP5	Olympia	98513	US	Query:US.ZIP5	2.191	2.576	1	...	938	489	452	27363	26262	38.1	82593	4	1564	21
6	6	7	US.ZIP5	Olympia	98516	US	Query:US.ZIP5	2.191	2.576	1	...	1160	663	497	21949	21105	40.7	89478	53	68868	451
7	7	8	US.ZIP5	Bucoda	98530	US	Query:US.ZIP5	2.191	2.576	1	...	11	7	5	299	293	44.2	50000	0	0	0
8	8	9	US.ZIP5	Centralia	98531	US	Query:US.ZIP5	2.191	2.576	1	...	901	670	861	19766	18864	38.3	52691	88	112573	1719
9	9	10	US.ZIP5	Rainier	98576	US	Query:US.ZIP5	2.191	2.576	1	...	151	77	66	4491	4319	44.4	78361	1	194	1
10	10	11	US.ZIP5	Rochester	98579	US	Query:US.ZIP5	2.191	2.576	1	...	466	260	172	11361	10874	42.1	76899	6	9334	55
11	11	12	US.ZIP5	Tenino	98589	US	Query:US.ZIP5	2.191	2.576	1	...	303	168	131	7238	6981	45.4	70765	6	3377	27
12	12	13	US.ZIP5	Yelm	98597	US	Query:US.ZIP5	2.191	2.576	1	...	647	371	303	20302	19411	37.5	74969	32	30649	200

13 rows × 78 columns

Add Aliases¶

Create Pandas Series for Alias Lookup¶

We can create a Pandas series enabling easy column alias lookup by removing duplicate names, set the index to the column name, and just keeping the alias column.

[7]:

var_lookup = ev.drop_duplicates('name').set_index('name')['alias']

var_lookup

[7]:

name
AGE0_CY                                  2021 Population Age <1
AGE1_CY                                   2021 Population Age 1
AGE2_CY                                   2021 Population Age 2
AGE3_CY                                   2021 Population Age 3
AGE4_CY                                   2021 Population Age 4
                                    ...
RELRMV1989       2019 RHHs/Moved In: 1989/Before REL (ACS 5-Yr)
ACSMEDYRMV     2019 Median Year Householder Moved In (ACS 5-Yr)
MOEMEDYRMV    2019 Median Year Householder Moved In MOE (ACS...
RELMEDYRMV    2019 Median Year Householder Moved In REL (ACS...
RELOWNER                   2019 Owner Households REL (ACS 5-Yr)
Name: alias, Length: 16275, dtype: object

Use Alias List to Look Up Relevant Column Aliases¶

Using a ternerary operator in a list comprehension with the Pandas Series created in the last step enables us to look up aliases if there is a match and keep the existing column name if there is not a match. This enables us to create a list of column names for the output data.

[8]:

alias_lst = [var_lookup.loc[c] if c in var_lookup.index else c for c in enrich_df.columns]

alias_lst

[8]:

['ID',
 'OBJECTID',
 'StdGeographyLevel',
 'StdGeographyName',
 'StdGeographyID',
 'sourceCountry',
 'aggregationMethod',
 'populationToPolygonSizeRating',
 'apportionmentConfidence',
 'HasData',
 '2021 Health Care',
 '2021 Health Insurance',
 '2021 Fee for Svc Health Plan Excl Blue Cross/Blue Shield',
 '2021 HMO Excl Blue Cross/Blue Shield',
 '2021 Medicare Supplements Excl Blue Cross/Blue Shield',
 '2021 Blue Cross/Blue Shield',
 '2021 Blue Cross/Blue Shield Fee-Svc Health Plan',
 '2021 Blue Cross/Blue Shield HMO',
 '2021 Blue Cross/Blue Shield Commercial Medicare Supplmt',
 '2021 Blue Cross/Blue Shield Dental Care Insur',
 '2021 Blue Cross/Blue Shield Vision Care Insur',
 '2021 Blue Cross/Blue Shield Prescription Drug Insur',
 '2021 Blue Cross/Blue Shield Long Term Care Insur',
 '2021 Blue Cross/Blue Shield Oth Single Svc Insur',
 '2021 Medicare Payments',
 '2021 Medicaid Premiums',
 '2021 Tricare/Military Premiums',
 '2021 Children`s Health Insur Program (CHIP) Premiums',
 '2021 Medical Care',
 '2021 Medical Services',
 '2021 Physician Services',
 '2021 Dental Services',
 '2021 Eyecare Services',
 '2021 Non-Physician Svcs Inside Home',
 '2021 Non-Physician Svcs Outside Home',
 '2021 Lab Tests/X-Rays',
 '2021 Hospital Room/Hospital Svc',
 '2021 Convalescent/Nursing Home Care',
 '2021 Oth Medical Svcs',
 '2021 Drugs/Vitamins',
 '2021 NonPrescription Drugs',
 '2021 Prescription Drugs',
 '2021 NonPrescription Vitamins',
 '2021 Medicare Prescription Drug Premium',
 '2021 Medical Supplies',
 '2021 Eyeglasses/Contact Lenses',
 '2021 Hearing Aids',
 '2021 Topicals/Dressings',
 '2021 Medical Equip-General Use',
 '2021 Supportive/Convalescent Medical Equip',
 '2021 Rental of Medical Equip-General Use',
 '2021 Rental of Supportive/Convalescent Medical Equip',
 '2021 Dom Tapestry Segment Name',
 '2021 Population Age 0-4',
 '2021 Population Age 5-9',
 '2021 Population Age 10-14',
 '2021 Population Age 15-19',
 '2021 Population Age 20-24',
 '2021 Population Age 25-29',
 '2021 Population Age 30-34',
 '2021 Population Age 35-39',
 '2021 Population Age 40-44',
 '2021 Population Age 45-49',
 '2021 Population Age 50-54',
 '2021 Population Age 55-59',
 '2021 Population Age 60-64',
 '2021 Population Age 65-69',
 '2021 Population Age 70-74',
 '2021 Population Age 75-79',
 '2021 Population Age 80-84',
 '2021 Population Age 85+',
 '2021 Population Age 18+',
 '2021 Population Age 21+',
 '2021 Median Age',
 '2021 Median Household Income',
 '2021 Health Services - Businesses (SIC)',
 '2021 Health Services - Sales ($000) (SIC)',
 '2021 Health Services - Employees (SIC)']

Prune Columns¶

If the intention for the output data is for subsequent analysis, it is easier to just have the unique identifier, in this case the zip code, and the enriched columns in the final output. We can create a list of these columns using a list comprehension to filter the column names.

[9]:

id_col = 'StdGeographyID'

keep_col_lst = [id_col] + [c for c in enrich_df.columns if c in var_lookup.index]

sel_df = enrich_df.loc[:,keep_col_lst]

sel_df.head()

[9]:

	StdGeographyID	X8001_X	X8002_X	X8041_X	X8012_X	X8015_X	X8006_X	X8042_X	X8009_X	X8010_X	...	POP75_CY	POP80_CY	POP85_CY	POP18UP_CY	POP21UP_CY	MEDAGE_CY	MEDHINC_CY	S27_BUS	S27_SALES	S27_EMP
0	98501	125564861	82946494	17480482	14884608	7170661	21398640	14855839	5515270	480180	...	1516	945	1028	36826	35334	41.9	80354	112	140533	933
1	98502	95864256	63381327	12919620	11663331	5582000	16126731	11195309	4135517	370443	...	1141	674	794	29992	28297	39.8	76355	182	266062	1790
2	98503	93164805	61554303	12895647	10808456	5387191	16054938	11152453	4129485	358297	...	1326	915	1318	33065	31301	37.5	70669	96	242272	1643
3	98506	53087085	35003255	7063814	6065116	3285978	8817808	6116308	2249011	222312	...	714	440	718	16272	15755	45.3	72398	171	297161	4198
4	98512	84464691	55700618	11352475	9668400	5058839	14429007	10061386	3651490	352823	...	1039	610	621	26305	25273	43.2	81002	41	39399	370

5 rows × 69 columns

Apply Aliases¶

Using the same method as above, we can create a list of aliases. These aliases can then be applied to the output data frame. Also, to faciliate quick retrieval by ID, we can set the index to this ID.

[10]:

alias_lst = [var_lookup.loc[c] if c in var_lookup.index else c for c in keep_col_lst]

sel_df.columns = alias_lst

sel_df.set_index('StdGeographyID', inplace=True)

sel_df.head()

[10]:

	2021 Health Care	2021 Health Insurance	2021 Fee for Svc Health Plan Excl Blue Cross/Blue Shield	2021 HMO Excl Blue Cross/Blue Shield	2021 Medicare Supplements Excl Blue Cross/Blue Shield	2021 Blue Cross/Blue Shield	2021 Blue Cross/Blue Shield Fee-Svc Health Plan	2021 Blue Cross/Blue Shield HMO	2021 Blue Cross/Blue Shield Commercial Medicare Supplmt	2021 Blue Cross/Blue Shield Dental Care Insur	...	2021 Population Age 75-79	2021 Population Age 80-84	2021 Population Age 85+	2021 Population Age 18+	2021 Population Age 21+	2021 Median Age	2021 Median Household Income	2021 Health Services - Businesses (SIC)	2021 Health Services - Sales ($000) (SIC)	2021 Health Services - Employees (SIC)
StdGeographyID
98501	125564861	82946494	17480482	14884608	7170661	21398640	14855839	5515270	480180	327850	...	1516	945	1028	36826	35334	41.9	80354	112	140533	933
98502	95864256	63381327	12919620	11663331	5582000	16126731	11195309	4135517	370443	248872	...	1141	674	794	29992	28297	39.8	76355	182	266062	1790
98503	93164805	61554303	12895647	10808456	5387191	16054938	11152453	4129485	358297	258021	...	1326	915	1318	33065	31301	37.5	70669	96	242272	1643
98506	53087085	35003255	7063814	6065116	3285978	8817808	6116308	2249011	222312	138775	...	714	440	718	16272	15755	45.3	72398	171	297161	4198
98512	84464691	55700618	11352475	9668400	5058839	14429007	10061386	3651490	352823	221560	...	1039	610	621	26305	25273	43.2	81002	41	39399	370

5 rows × 68 columns

Final Product - Export to Excel¶

For follow on analysis using Excel, Pandas data frames can easily be saved to Excel.

[11]:

sel_df.to_excel(dir_int/'esri_enriched.xlsx')

BA-Python-API-Examples

Navigation

Related Topics