rwskit.pandas ============= .. py:module:: rwskit.pandas .. autoapi-nested-parse:: Utilities for working with pandas. Attributes ---------- .. autoapisummary:: rwskit.pandas.log Functions --------- .. autoapisummary:: rwskit.pandas.flatten_data_frame Module Contents --------------- .. py:data:: log .. py:function:: flatten_data_frame(df: pandas.DataFrame, string_fill: str = '[UNK]', in_place: bool = False) -> pandas.DataFrame Converts columns containing lists into (new) individual columns in the ``DataFrame``. If one or more columns in a DataFrame consist of lists, this method will remove the original column and replace it with ``N`` columns, where ``N`` is the maximum length of the lists in the original column. If the lists are of unequal length, the additional columns will be appended to the right. Lists of strings will be padded using the given ``string_fill`` value. All others will be padded with ``np.nan``. Note, most numpy types will convert ``np.nan`` into an appropriate missing value for that type. For example, when used to fill ``np.datetime64`` objects, the resulting object will be ``np.datetime64('NaT')``. If the lists are numeric (including boolean) and they do not have equal lengths, the new columns will have ``dtype=np.float64`` regardless of the original dtype. .. note:: Nested lists within a column are not supported and will not be flattened. :param df: The input DataFrame to flatten. :type df: pandas.DataFrame :param string_fill: Use this value to pad string lists. All other data types will use ``np.nan`` :type string_fill: any, defualt = '[UNK]' :param in_place: Whether to modify the DataFrame in place or return a copy. :type in_place: bool, default = False :returns: **df** -- The modified DataFrame :rtype: pandas.DataFrame .. rubric:: Examples .. code-block:: python >>>input_df = pd.DataFrame({ "A": [["1"], ["2", "3"]], "B": [["4", "5"], ["6", "7", "8"]], "C": [[1], [2, 3]], "D": [True, False] }) >>>print(input_df) A B C D 0 [1] [4, 5] [1] True 1 [2, 3] [6, 7, 8] [2, 3] False >>>flatten_data_frame(input_df) A__0 A__1 B__0 B__1 B__2 C__0 C__1 D 0 1 [UNK] 4 5 [UNK] 1.0 NaN True 1 2 3 6 7 8 2.0 3.0 False