Manipulate DataFrames
The commonly used Pandas and GeoPandas libraries are well documented, and many examples showing how to use them to perform data analysis and manipulation are publicly available. Generally, data is in a tabular representation where each cell of the table contains one value with a defined data type (numeric, string, or other basic type).
Map data and, in general, data stored in a catalog can be highly structured sometimes and follow a complex, nested schema. Dealing with this complexity in Pandas can be difficult. Therefore, the HERE Data SDK for Python includes in the here-geopandas-adapter package utility functions to perform repetitive tasks and manipulate complex DataFrames, in particular DataFrames with columns that contain dictionaries instead of single values.
Unpacking series and DataFrames
Pandas provides the explode function to turn objects of type list contained in a column into multiple rows. Similarly, HERE Data SDK for Python provides the unpack and unpack_columns functions to turn single columns containing dict into multiple columns. This is a convenience function to unpack data structures that sometimes result from reading data from catalogs or working with complex data models.
unpack is applied to a Series containing dict objects, it returns a DataFrame. unpack_columns is applied to a DataFrame to replace one or more column that contain dict objects with multiple columns, one for each field of the dictionaries. Unpacking is also recursive, to deal easily with deeply nested data structures.
Example: unpacking a DataFrame column that contains dictionaries
Given the example DataFrame df, derived from structured objects:
import pandas as pd
berlin = {
"name": "Berlin",
"location": {
"longitude": 13.408333,
"latitude": 52.518611,
"country": { "name": "Deutschland", "code": "DE" }
},
"zip_codes": { "min": 10115, "max": 14199 },
"population": 3664088
}
paris = {
"name": "Paris",
"location": {
"longitude": 2.351667,
"latitude": 48.856667,
"country": { "name": "France", "code": "FR" }
},
"zip_codes": { "min": 75001, "max": 75020 },
"population": 2175601
}
df = pd.DataFrame([berlin, paris])resulting in:
| name | location | zip_codes | population | |
|---|---|---|---|---|
| 0 | Berlin | {'longitude': 13.408333, 'latitude': 52.518611, 'country': {'name': 'Deutschland', 'code': 'DE'}} | {'min': 10115, 'max': 14199} | 3664088 |
| 1 | Paris | {'longitude': 2.351667, 'latitude': 48.856667, 'country': {'name': 'France', 'code': 'FR'}} | {'min': 75001, 'max': 75020} | 2175601 |
We can unpack the columns location and zip_codes containing dictionaries that otherwise would be difficult to operate with. Unpacking is recursive and unpacks also nested dictionaries, for example country contained in location.
from here.geopandas_adapter.utils.dataframe import unpack_columns
unpacked_df = unpack_columns(df, columns=["location", "zip_codes"])resulting in:
| name | location.longitude | location.latitude | location.country.name | location.country.code | zip_codes.min | zip_codes.max | population | |
|---|---|---|---|---|---|---|---|---|
| 0 | Berlin | 13.4083 | 52.5186 | Deutschland | DE | 10115 | 14199 | 3664088 |
| 1 | Paris | 2.35167 | 48.8567 | France | FR | 75001 | 75020 | 2175601 |
Replacing a column with one or more columns
The function replace_column can be used to replace one single column of a DataFrame with one or multiple columns of another DataFrame.
Example: replacing one column with a multiple columns
Given the example DataFrames df and df2:
import pandas as pd
df = pd.DataFrame({
"col_A": [11, 31, 41],
"col_B": [12, 32, 42],
"col_C": [14, 34, 42]
}, index = [1, 3, 4])
df2 = pd.DataFrame({
"col_Bx": [110, 130, 140],
"col_By": [115, 135, 145]
}, index = [1, 3, 4])resulting in:
| col_A | col_B | col_C | |
|---|---|---|---|
| 1 | 11 | 12 | 14 |
| 3 | 31 | 32 | 34 |
| 4 | 41 | 42 | 42 |
and:
| col_Bx | col_By | |
|---|---|---|
| 1 | 110 | 115 |
| 3 | 130 | 135 |
| 4 | 140 | 145 |
We can replace col_B with col_Bx and col_By:
from here.geopandas_adapter.utils.dataframe import replace_column
replaced_df = replace_column(df, "col_B", df2)resulting in:
| col_A | col_Bx | col_By | col_C | |
|---|---|---|---|---|
| 1 | 11 | 110 | 115 | 14 |
| 3 | 31 | 130 | 135 | 34 |
| 4 | 41 | 140 | 145 | 42 |
Adding and removing prefixes to column names
The functions prefix_columns and unprefix_columns are used to add or remove a prefix from the names of selected columns of a DataFrame. A separator . is added between the prefix and column names.
This is useful to group (prefix) related columns of a DataFrame under a common prefix or to remove a lengthy, verbose prefix present in multiple columns (unprefix) to obtain a derived DataFrame that is more comfortable to work with.
Example: prefixing columns with common prefix
Given the example DataFrame df:
import pandas as pd
df = pd.DataFrame({
"name": ["Sarah", "Vivek", "Marco"],
"age": [41, 29, 35],
"house_nr": ["1492", "34-35", "48A"],
"road": ["SE 36th Ave", "Seshadri Road", "Via Giosuè Carducci"],
"city": ["Portland", "Bengaluru", "Milan"],
"zip": [97214, 560009, 20123],
"state": ["OR", "KA", pd.NA],
"country": ["US", "IN", "IT"],
})resulting in:
| name | age | house_nr | road | city | zip | state | country | |
|---|---|---|---|---|---|---|---|---|
| 0 | Sarah | 41 | 1492 | SE 36th Ave | Portland | 97214 | OR | US |
| 1 | Vivek | 29 | 34-35 | Seshadri Road | Bengaluru | 560009 | KA | IN |
| 2 | Marco | 35 | 48A | Via Giosuè Carducci | Milan | 20123 | <NA> | IT |
We can group columns that are part of the address, prefixing them with address:
from here.geopandas_adapter.utils.dataframe import prefix_columns
prefixed_df = prefix_columns(df, "address", ["house_nr", "road", "city", "zip", "country", "state"])resulting in:
| name | age | address.house_nr | address.road | address.city | address.zip | address.state | address.country | |
|---|---|---|---|---|---|---|---|---|
| 0 | Sarah | 41 | 1492 | SE 36th Ave | Portland | 97214 | OR | US |
| 1 | Vivek | 29 | 34-35 | Seshadri Road | Bengaluru | 560009 | KA | IN |
| 2 | Marco | 35 | 48A | Via Giosuè Carducci | Milan | 20123 | <NA> | IT |
Example: removing a common prefix
Continuing the example above, we can remove the address prefix and obtain the original DataFrame:
from here.geopandas_adapter.utils.dataframe import unprefix_columns
unprefixed_df = unprefix_columns(prefixed_df, "address")resulting in:
| name | age | house_nr | road | city | zip | state | country | |
|---|---|---|---|---|---|---|---|---|
| 0 | Sarah | 41 | 1492 | SE 36th Ave | Portland | 97214 | OR | US |
| 1 | Vivek | 29 | 34-35 | Seshadri Road | Bengaluru | 560009 | KA | IN |
| 2 | Marco | 35 | 48A | Via Giosuè Carducci | Milan | 20123 | <NA> | IT |
Updated 2 days ago