API Utilities

These functionalities are supporting capabilities, documented for convienence.

`format_pandas_for_logging(pandas_df, title, line_tab_prefix='\t\t')`

Helper function facilitating outputting a :class:Pandas DataFrame<pandas.DataFrame> into a logfile. This function only formats the data frame into text for output. It should be used in conjunction with a logging method.

logging.info(format_pandas_for_logging(df, title='Summary Statistics'))

Parameters:

Name	Type	Description	Default
`pandas_df`	`DataFrame`	Pandas `DataFrame` to be converted to a string and included in the logfile.	required
`title`	`str`	String title describing the data frame.	required
`line_tab_prefix`		Optional string comprised of tabs (`\t\t`) to prefix each line with providing indentation.	`'\t\t'`

Source code in src/arcpy_parquet/utils/logging_utils.py

def format_pandas_for_logging(
    pandas_df: "pd.DataFrame", title: str, line_tab_prefix="\t\t"
) -> str:
    """
    Helper function facilitating outputting a :class:`Pandas DataFrame<pandas.DataFrame>` into a logfile. This function only
        formats the data frame into text for output. It should be used in conjunction with a logging method.

    ``` python
    logging.info(format_pandas_for_logging(df, title='Summary Statistics'))
    ```

    Args:
        pandas_df: Pandas ``DataFrame`` to be converted to a string and included in the logfile.
        title: String title describing the data frame.
        line_tab_prefix: Optional string comprised of tabs (``\\t\\t``) to prefix each line with providing indentation.
    """
    import pandas as pd

    log_str = line_tab_prefix.join(pandas_df.to_string(index=False).splitlines(True))
    log_str = f"{title}:\n{line_tab_prefix}{log_str}"
    return log_str

`get_logger(level='INFO', logger_name=None, logfile_path=None, propagate=True)`

Get Python Logger configured to provide stream, file or, if available, ArcPy output. The way the method is set up, logging will be routed through ArcPy if ArcPy is available. If ArcPy is not available, messages will be sent to the console using a StreamHandler. Next, if the logfile_path is provided, log messages will also be written to the provided path to a logfile using a FileHandler.

Valid log_level inputs include:

DEBUG - Detailed information, typically of interest only when diagnosing problems.
INFO - Confirmation that things are working as expected.
WARNING - An indication that something unexpected happened, or indicative of some problem in the near future (e.g. "disk space low"). The software is still working as expected.
ERROR - Due to a more serious problem, the software has not been able to perform some function.
CRITICAL - A serious error, indicating that the program itself may be unable to continue running.

Parameters:

Name	Type	Description	Default
`level`	`Optional[Union[str, int]]`	Logging level to use. Default is `'INFO'`.	`'INFO'`
`logger_name`	`Optional[str]`	Optional logger name to use. If not provided, gets and returns default logger.	`None`
`logfile_path`	`Union[Path, str]`	Where to save the logfile if file output is desired.	`None`
`propagate`	`bool`	Whether to propagate messages up to the default logger or not.	`True`

# only output to console and potentially Pro if ArcPy is available
configure_logging('DEBUG')
logging.debug('nauseatingly detailed debugging message')
logging.info('something actually useful to know')
logging.warning('The sky may be falling')
logging.error('The sky is falling.)
logging.critical('The sky appears to be falling because a giant meteor is colliding with the earth.')

Source code in src/arcpy_parquet/utils/logging_utils.py

def get_logger(
    level: Optional[Union[str, int]] = "INFO",
    logger_name: Optional[str] = None,
    logfile_path: Union[Path, str] = None,
    propagate: bool = True,
) -> logging.Logger:
    """
    Get Python [`Logger`][logging.Logger] configured to provide stream, file or, if available, ArcPy output.
    The way the method is set up, logging will be routed through ArcPy if ArcPy is available. If ArcPy is *not*
    available, messages will be sent to the console using a [`StreamHandler`][logging.StreamHandler]. Next, if
    the `logfile_path` is provided, log messages will also be written to the provided path to a logfile using
    a [`FileHandler`][logging.FileHandler].

    Valid `log_level` inputs include:

    * `DEBUG` - Detailed information, typically of interest only when diagnosing problems.
    * `INFO` - Confirmation that things are working as expected.
    * `WARNING` -  An indication that something unexpected happened, or indicative of some problem in the
        near future (e.g. "disk space low"). The software is still working as expected.
    * `ERROR` - Due to a more serious problem, the software has not been able to perform some function.
    * `CRITICAL` - A serious error, indicating that the program itself may be unable to continue running.

    Args:
        level: Logging level to use. Default is `'INFO'`.
        logger_name: Optional logger name to use. If not provided, gets and returns default logger.
        logfile_path: Where to save the logfile if file output is desired.
        propagate: Whether to propagate messages up to the default logger or not.

    ``` python
    # only output to console and potentially Pro if ArcPy is available
    configure_logging('DEBUG')
    logging.debug('nauseatingly detailed debugging message')
    logging.info('something actually useful to know')
    logging.warning('The sky may be falling')
    logging.error('The sky is falling.)
    logging.critical('The sky appears to be falling because a giant meteor is colliding with the earth.')
    ```

    """
    # ensure valid logging level
    log_str_lst = ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL", "WARN", "FATAL"]
    log_int_lst = [0, 10, 20, 30, 40, 50]

    if not isinstance(level, (str, int)):
        raise ValueError(
            "You must define a specific logging level for log_level as a string or integer."
        )
    elif isinstance(level, str) and level not in log_str_lst:
        raise ValueError(
            f'The log_level must be one of {log_str_lst}. You provided "{level}".'
        )
    elif isinstance(level, int) and level not in log_int_lst:
        raise ValueError(
            f"If providing an integer for log_level, it must be one of the following, {log_int_lst}."
        )

    # get default logger and set logging level at the same time
    logger = logging.getLogger(name=logger_name)
    logger.setLevel(level=level)

    # clear handlers if Default loger so can configure formatting for streaming handler.
    if logger_name is None:
        logger.handlers.clear()

    # control bubbling if desired
    logger.propagate = propagate

    # make sure at least a stream handler is present
    if not logfile_path and not has_arcpy:
        ch = logging.StreamHandler()
        ch.setFormatter(standard_log_format)
        logger.addHandler(ch)

    # if in an environment with ArcPy, add handler to bubble logging up to ArcGIS through ArcPy
    if has_arcpy:
        ah = ArcpyHandler()
        ah.setFormatter(standard_log_format)
        logger.addHandler(ah)

    # if a path for the logfile is provided, log results to the file
    if logfile_path is not None:
        # ensure the full path exists
        if not logfile_path.parent.exists():
            logfile_path.parent.mkdir(parents=True)

        # create and add the file handler
        fh = logging.FileHandler(str(logfile_path))
        fh.setFormatter(standard_log_format)
        logger.addHandler(fh)

    return logger

`ensure_parquet_dataset(parquet_dataset)`

Ensure the input is a ParquetDataset object.

Source code in src/arcpy_parquet/utils/parquet.py

def ensure_parquet_dataset(
    parquet_dataset: Union[str, Path, pq.ParquetDataset]
) -> pq.ParquetDataset:
    """Ensure the input is a ParquetDataset object."""
    # don't do anything if already a ParquetDataset
    if not isinstance(parquet_dataset, pq.ParquetDataset):
        # if a string, make into a path
        if isinstance(parquet_dataset, str):
            parquet_dataset = Path(parquet_dataset)

        # ensure the path exists
        if not parquet_dataset.exists():
            raise FileNotFoundError(
                f"Cannot resolve Parquet dataset path: {parquet_dataset}"
            )

        # create a ParquetDataset object
        parquet_dataset = pq.ParquetDataset(parquet_dataset)

    return parquet_dataset

`get_geoparquet_bbox(parquet_dataset)`

For a Geoparquet dataset, get the full maximum bounding box.

Source code in src/arcpy_parquet/utils/parquet.py

def get_geoparquet_bbox(
    parquet_dataset: Union[str, Path, pq.ParquetDataset]
) -> list[int]:
    """For a Geoparquet dataset, get the full maximum bounding box."""
    dataset = ensure_parquet_dataset(parquet_dataset)

    # get the explicitly added metadata for all the files
    meta_lst = [pq.read_metadata(fl).metadata for fl in dataset.files]

    # get the geography information - the metadata making the parquet dataset Geoparquet
    geo_binary_lst = [meta.get(b"geo") for meta in meta_lst]

    # convert the binary string into a list of dictionaries
    geo_lst = [json.loads(geo) for geo in geo_binary_lst]

    # get the geography definitions without the bounding boxes, and convert back to strings so can be compared in a set
    geo_set = set(
        json.dumps(
            {
                nm: {k: v for k, v in col_dict.items() if k != "bbox"}
                for nm, col_dict in geo.get("columns").items()
            }
        )
        for geo in geo_lst
    )

    # ensure only one geography is present
    if len(geo_set) > 1:
        raise ValueError(
            "More than one spatial reference detected. Cannot convert data."
        )

    # get the bounding box for all the files, the entire parquet dataset
    coords_lst = list(
        zip(*[geo.get("columns").get("geometry").get("bbox") for geo in geo_lst])
    )

    min_coords = [min(coords) for coords in coords_lst[:2]]
    max_coords = [max(coords) for coords in coords_lst[2:]]

    # create the bounding box list of coordinates
    bbox = min_coords + max_coords

    return bbox

`get_parquet_max_string_lengths(parquet_dataset)`

For a Parquet dataset, get the maximum string lengths for all string columns.

Parameters:

Name	Type	Description	Default
`parquet_dataset`	`Union[str, Path, ParquetDataset]`	Path to Parquet dataset.	required

Source code in src/arcpy_parquet/utils/parquet.py

def get_parquet_max_string_lengths(
    parquet_dataset: Union[str, Path, pq.ParquetDataset]
) -> dict[str, int]:
    """
    For a Parquet dataset, get the maximum string lengths for all string columns.

    Args:
        parquet_dataset: Path to Parquet dataset.
    """
    dataset = ensure_parquet_dataset(parquet_dataset)

    # identify string columns
    string_columns = [
        field.name for field in dataset.schema if pa.types.is_string(field.type)
    ]

    # initialize dictionary to store max lengths
    max_lengths = {col: 0 for col in string_columns}

    # initialize the reader
    reader = dataset.read()

    # iterate over string columns
    for col in string_columns:
        column = reader.column(col)

        # iterate chunks in the column
        for chunk in column.chunks:
            max_len = max(
                (len(str(val)) for val in chunk if val is not None), default=0
            )
            max_lengths[col] = max(max_lengths[col], max_len)

    return max_lengths

`get_spatial_reference_projjson(spatial_reference)`

Get the PROJJSON representation of a Spatial Reference.

!!! note:

Spatial reference can be submitted as either an `arcpy.SpatialReference` object, dictionary with the
well known identifier (WKID) or the integer well known identifier. For instance, for WGS84, this can
be one of the following:

* `arcpy.SpatialReference(4326)`
* `{'wkid': 4326}`
* `4326`

Parameters:

Name	Type	Description	Default
`spatial_reference`	`Union[int, dict, SpatialReference]`	The spatial reference to get the PROJJSON for.	required

Source code in src/arcpy_parquet/utils/parquet.py

def get_spatial_reference_projjson(
    spatial_reference: Union[int, dict, "arcpy.SpatialReference"]
) -> dict:
    """
    Get the PROJJSON representation of a Spatial Reference.

    !!! note:

        Spatial reference can be submitted as either an `arcpy.SpatialReference` object, dictionary with the
        well known identifier (WKID) or the integer well known identifier. For instance, for WGS84, this can
        be one of the following:

        * `arcpy.SpatialReference(4326)`
        * `{'wkid': 4326}`
        * `4326`

    Args:
        spatial_reference: The spatial reference to get the PROJJSON for.
    """
    # late import arcpy to avoid dependency if not needed
    import arcpy

    # message if cannot figure out spatial reference
    err_msg = (
        "Cannot determine the spatial reference from the input, please provide either an arcpy.SpatialReference or"
        "the well known identifier integer for the spatial reference."
    )

    # try to convert to string representation of a dict
    if isinstance(spatial_reference, str):
        try:
            # try to load the spatial reference string to a dictionary
            spatial_reference = json.loads(spatial_reference)

        except ValueError:
            raise ValueError(err_msg)

    # if a dictionary, try to get the wkid out of it
    if isinstance(spatial_reference, dict):
        spatial_reference = int(spatial_reference.get("wkid"))

        if spatial_reference is None:
            raise ValueError(err_msg)

    # if the spatial reference is a string representing a number, convert to an integer
    if isinstance(spatial_reference, str) and spatial_reference.isnumeric():
        spatial_reference = int(spatial_reference)

    # create an ArcPy SpatialReference object from the wkid
    if not isinstance(spatial_reference, arcpy.SpatialReference):
        spatial_reference = arcpy.SpatialReference(spatial_reference)

    # convert the spatial reference to the well known text representation
    wkt2_str = spatial_reference.exportToString("WKT2")

    # silence future warning, and ensure any issues encountered bubble up
    osr.UseExceptions()

    # use OSGeo to convert the spatial reference to PROJJSON from WKT2
    srs = osr.SpatialReference()
    srs.ImportFromWkt(wkt2_str)

    prjson = json.loads(srs.ExportToPROJJSON())

    return prjson