NaN filter of section elements too strong (#1) · Issues · brivas / cdm-mapper

NaN filter of section elements too strong

Hello @dyb

The problem comes when it we analyse a deck like 704. In here we have repetitive variables in two different sections of the data set. Sometimes one section will be present and sometimes it wont.

The mapper is coded to delete any NAN before transforming this element according to the functions written in the imodel.py. The main mapper.py is probably the easiest way to code a solution and filter out the section that is not present.

Alternatively we could catch these errors in the mdf_reader() output. But will also need coding.

For the CDM mapper the solution is stated below...

PROBLEM

e.g. ship_speed is found in two sections of the data.c99

data_raw.data.c99_data5.ship_speed.head()

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: ship_speed, dtype: float16

data_raw.data.c99_data4.ship_speed.head()
0    8.5
1    8.5
2    8.5
3    8.0
4    8.5
Name: ship_speed, dtype: float16

we want to be able to write this variable in the mapper as a field in the header.json file

"station_speed": {
        "sections": ["c99_data4","c99_data5"],
        "elements": ["ship_speed", "ship_speed"],
        "transform": "speed_converter",
        "decimal_places": 2
    }

And use the speed_converter function defined in the imodel.py:

    def speed_converter(self, ds):
        """
        Picks a ship speed where there is a column with data and converts it
        to m/s.

        Parameters
        ----------
        ds: a pandas.DataFrame() with the input data from the header.json
        Returns
        -------
        ship_speed: the ship speed that is not nan in m/s
        """
        # Identifies if a column from ds is nan
        col_na = ds.columns[ds.isna().any()].tolist()
        # Drops the column that has only nan 
        speed = ds.drop(columns=col_na[0])
        # Convert that speed to m/s
        return np.round(speed.iloc[:, 0] * 0.514444, 2)

The problem lies in mapper.py#99. Where basically scripts inside the imodel.py wont get executed if there is no data to reach those functions. Due to the condition on this line: line#105

The condition on those lines, (Having data) is never met and the code that 'transforms' the data according to imodel.py never gets executed.

c99_data4.ship_speed also has NaN in some rows. Therefore, the statement .all() in line 99 is a too strong filter of NaN's and eliminates both columns c99_data5 and c99_data4.

Even if the data in c99_data4 only has few NaN's this gets drop because the nan filtering is done on rows along the concatenated df[['c99_data4','c99_data5']].

So, if there is a NaN in c99_data5 the row in c99_data4 will go as well (.all() condition is too strong). Since both arrays are the same length, all c99_data4 rows get eliminated because c99_data5 is all empty.

SOLUTION

My suggestion is add to filter the NaN by column first and then row. Or better let the user decide this and add to the code an imapping object under here and let the user control how he or she wants to deal with nans from the json file. This statement will be added to a variable in case that the data looks like deck 704.

Yes we want to make imodel.py as simple as possible but sometimes you have to code extra lines.

e.g. If statement to go in the code

if nan_filter == 'any':
     notna_idx = np.where(idata[elements].notna().any(axis=1))[0]
else:
     notna_idx = np.where(idata[elements].notna().all(axis = 1))[0]

e.g. Add a new attribute in header.json

"station_speed": {
        "sections": ["c99_data4","c99_data5"],
        "nan_filter": "any",
        "elements": ["ship_speed", "ship_speed"],
        "transform": "speed_converter",
        "decimal_places": 2
    }

Edited Jan 22, 2021 by brivas