Manipulate DataFrames

The commonly used Pandas and GeoPandas libraries are well documented, and many examples showing how to use them to perform data analysis and manipulation are publicly available. Generally, data is in a tabular representation where each cell of the table contains one value with a defined data type (numeric, string, or other basic type).

Map data and, in general, data stored in a catalog can be highly structured sometimes and follow a complex, nested schema. Dealing with this complexity in Pandas can be difficult. Therefore, the HERE Data SDK for Python includes in the here-geopandas-adapter package utility functions to perform repetitive tasks and manipulate complex DataFrames, in particular DataFrames with columns that contain dictionaries instead of single values.

Unpacking series and DataFrames

Pandas provides the explode function to turn objects of type list contained in a column into multiple rows. Similarly, HERE Data SDK for Python provides the unpack and unpack_columns functions to turn single columns containing dict into multiple columns. This is a convenience function to unpack data structures that sometimes result from reading data from catalogs or working with complex data models.

unpack is applied to a Series containing dict objects, it returns a DataFrame. unpack_columns is applied to a DataFrame to replace one or more column that contain dict objects with multiple columns, one for each field of the dictionaries. Unpacking is also recursive, to deal easily with deeply nested data structures.

Example: unpacking a DataFrame column that contains dictionaries

Given the example DataFrame df, derived from structured objects:

import pandas as pd

berlin = {
    "name": "Berlin",
    "location": {
        "longitude": 13.408333,
        "latitude": 52.518611,
        "country": { "name": "Deutschland", "code": "DE" }
    },
    "zip_codes": { "min": 10115, "max": 14199 },
    "population": 3664088
}

paris = {
    "name": "Paris",
    "location": {
        "longitude": 2.351667,
        "latitude": 48.856667,
        "country": { "name": "France", "code": "FR" }
    },
    "zip_codes": { "min": 75001, "max": 75020 },
    "population": 2175601
}

df = pd.DataFrame([berlin, paris])

resulting in:

	name	location	zip_codes	population
0	Berlin	`{'longitude': 13.408333, 'latitude': 52.518611, 'country': {'name': 'Deutschland', 'code': 'DE'}}`	`{'min': 10115, 'max': 14199}`	3664088
1	Paris	`{'longitude': 2.351667, 'latitude': 48.856667, 'country': {'name': 'France', 'code': 'FR'}}`	`{'min': 75001, 'max': 75020}`	2175601

We can unpack the columns location and zip_codes containing dictionaries that otherwise would be difficult to operate with. Unpacking is recursive and unpacks also nested dictionaries, for example country contained in location.

from here.geopandas_adapter.utils.dataframe import unpack_columns

unpacked_df = unpack_columns(df, columns=["location", "zip_codes"])

resulting in:

	name	location.longitude	location.latitude	location.country.name	location.country.code	zip_codes.min	zip_codes.max	population
0	Berlin	13.4083	52.5186	Deutschland	DE	10115	14199	3664088
1	Paris	2.35167	48.8567	France	FR	75001	75020	2175601

Replacing a column with one or more columns

The function replace_column can be used to replace one single column of a DataFrame with one or multiple columns of another DataFrame.

Example: replacing one column with a multiple columns

Given the example DataFrames df and df2:

import pandas as pd

df = pd.DataFrame({
    "col_A": [11, 31, 41],
    "col_B": [12, 32, 42],
    "col_C": [14, 34, 42]
}, index = [1, 3, 4])

df2 = pd.DataFrame({
    "col_Bx": [110, 130, 140],
    "col_By": [115, 135, 145]
}, index = [1, 3, 4])

resulting in:

	col_A	col_B	col_C
1	11	12	14
3	31	32	34
4	41	42	42

and:

	col_Bx	col_By
1	110	115
3	130	135
4	140	145

We can replace col_B with col_Bx and col_By:

from here.geopandas_adapter.utils.dataframe import replace_column

replaced_df = replace_column(df, "col_B", df2)

resulting in:

	col_A	col_Bx	col_By	col_C
1	11	110	115	14
3	31	130	135	34
4	41	140	145	42

Adding and removing prefixes to column names

The functions prefix_columns and unprefix_columns are used to add or remove a prefix from the names of selected columns of a DataFrame. A separator . is added between the prefix and column names.

This is useful to group (prefix) related columns of a DataFrame under a common prefix or to remove a lengthy, verbose prefix present in multiple columns (unprefix) to obtain a derived DataFrame that is more comfortable to work with.

Example: prefixing columns with common prefix

Given the example DataFrame df:

import pandas as pd

df = pd.DataFrame({
    "name": ["Sarah", "Vivek", "Marco"],
    "age": [41, 29, 35],
    "house_nr": ["1492", "34-35", "48A"],
    "road": ["SE 36th Ave", "Seshadri Road", "Via Giosuè Carducci"],
    "city": ["Portland", "Bengaluru", "Milan"],
    "zip": [97214, 560009, 20123],
    "state": ["OR", "KA", pd.NA],
    "country": ["US", "IN", "IT"],
})

resulting in:

	name	age	house_nr	road	city	zip	state	country
0	Sarah	41	1492	SE 36th Ave	Portland	97214	OR	US
1	Vivek	29	34-35	Seshadri Road	Bengaluru	560009	KA	IN
2	Marco	35	48A	Via Giosuè Carducci	Milan	20123	`<NA>`	IT

We can group columns that are part of the address, prefixing them with address:

from here.geopandas_adapter.utils.dataframe import prefix_columns

prefixed_df = prefix_columns(df, "address", ["house_nr", "road", "city", "zip", "country", "state"])

resulting in:

	name	age	address.house_nr	address.road	address.city	address.zip	address.state	address.country
0	Sarah	41	1492	SE 36th Ave	Portland	97214	OR	US
1	Vivek	29	34-35	Seshadri Road	Bengaluru	560009	KA	IN
2	Marco	35	48A	Via Giosuè Carducci	Milan	20123	`<NA>`	IT

Example: removing a common prefix

Continuing the example above, we can remove the address prefix and obtain the original DataFrame:

from here.geopandas_adapter.utils.dataframe import unprefix_columns

unprefixed_df = unprefix_columns(prefixed_df, "address")

resulting in:

	name	age	house_nr	road	city	zip	state	country
0	Sarah	41	1492	SE 36th Ave	Portland	97214	OR	US
1	Vivek	29	34-35	Seshadri Road	Bengaluru	560009	KA	IN
2	Marco	35	48A	Via Giosuè Carducci	Milan	20123	`<NA>`	IT