rwskit.pandas

Utilities for working with pandas.

Attributes

log

Functions

flatten_data_frame(→ pandas.DataFrame)

Converts columns containing lists into (new) individual columns in the DataFrame.

Module Contents

rwskit.pandas.log[source]

rwskit.pandas.flatten_data_frame(df: pandas.DataFrame, string_fill: str = '[UNK]', in_place: bool = False) → pandas.DataFrame[source]

Converts columns containing lists into (new) individual columns in the DataFrame.

If one or more columns in a DataFrame consist of lists, this method will remove the original column and replace it with N columns, where N is the maximum length of the lists in the original column.

If the lists are of unequal length, the additional columns will be appended to the right. Lists of strings will be padded using the given string_fill value. All others will be padded with np.nan. Note, most numpy types will convert np.nan into an appropriate missing value for that type. For example, when used to fill np.datetime64 objects, the resulting object will be np.datetime64('NaT').

If the lists are numeric (including boolean) and they do not have equal lengths, the new columns will have dtype=np.float64 regardless of the original dtype.

Note

Nested lists within a column are not supported and will not be flattened.

Parameters:

df (pandas.DataFrame) – The input DataFrame to flatten.
string_fill (any, defualt = '[UNK]') – Use this value to pad string lists. All other data types will use np.nan
in_place (bool, default = False) – Whether to modify the DataFrame in place or return a copy.

Returns:

df – The modified DataFrame

Return type:

pandas.DataFrame

Examples

>>>input_df = pd.DataFrame({
    "A": [["1"], ["2", "3"]],
    "B": [["4", "5"], ["6", "7", "8"]],
    "C": [[1], [2, 3]],
    "D": [True, False]
})
>>>print(input_df)
        A          B       C      D
0     [1]     [4, 5]     [1]   True
1  [2, 3]  [6, 7, 8]  [2, 3]  False

>>>flatten_data_frame(input_df)
  A__0   A__1 B__0 B__1   B__2  C__0  C__1      D
0    1  [UNK]    4    5  [UNK]   1.0   NaN   True
1    2      3    6    7      8   2.0   3.0  False