NaN filter of section elements too strong
Hello @dyb
The problem comes when it we analyse a deck like 704. In here we have repetitive variables in two different sections of the data set. Sometimes one section will be present and sometimes it wont.
The mapper is coded to delete any NAN before transforming this element according to the functions written in the imodel.py. The main mapper.py is probably the easiest way to code a solution and filter out the section that is not present.
Alternatively we could catch these errors in the mdf_reader()
output. But will also need coding.
For the CDM mapper the solution is stated below...
PROBLEM
e.g. ship_speed is found in two sections of the data.c99
data_raw.data.c99_data5.ship_speed.head()
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: ship_speed, dtype: float16
data_raw.data.c99_data4.ship_speed.head()
0 8.5
1 8.5
2 8.5
3 8.0
4 8.5
Name: ship_speed, dtype: float16
we want to be able to write this variable in the mapper as a field in the header.json file
"station_speed": {
"sections": ["c99_data4","c99_data5"],
"elements": ["ship_speed", "ship_speed"],
"transform": "speed_converter",
"decimal_places": 2
}
And use the speed_converter function defined in the imodel.py:
def speed_converter(self, ds):
"""
Picks a ship speed where there is a column with data and converts it
to m/s.
Parameters
----------
ds: a pandas.DataFrame() with the input data from the header.json
Returns
-------
ship_speed: the ship speed that is not nan in m/s
"""
# Identifies if a column from ds is nan
col_na = ds.columns[ds.isna().any()].tolist()
# Drops the column that has only nan
speed = ds.drop(columns=col_na[0])
# Convert that speed to m/s
return np.round(speed.iloc[:, 0] * 0.514444, 2)
The problem lies in mapper.py#99. Where basically scripts inside the imodel.py wont get executed if there is no data to reach those functions. Due to the condition on this line: line#105
The condition on those lines, (Having data) is never met and the code that 'transforms' the data according to imodel.py never gets executed.
c99_data4.ship_speed
also has NaN
in some rows. Therefore, the statement .all()
in line 99 is a too strong filter of NaN
's and eliminates both columns c99_data5 and c99_data4.
Even if the data in c99_data4 only has few NaN
's this gets drop because the nan filtering is done on rows along the concatenated df[['c99_data4','c99_data5']]
.
So, if there is a NaN in c99_data5
the row in c99_data4
will go as well (.all()
condition is too strong). Since both arrays are the same length, all c99_data4
rows get eliminated because c99_data5
is all empty.
SOLUTION
My suggestion is add to filter the NaN by column first and then row. Or better let the user decide this and add to the code an imapping object under here and let the user control how he or she wants to deal with nans from the json file. This statement will be added to a variable in case that the data looks like deck 704.
Yes we want to make imodel.py as simple as possible but sometimes you have to code extra lines.
e.g. If statement to go in the code
if nan_filter == 'any':
notna_idx = np.where(idata[elements].notna().any(axis=1))[0]
else:
notna_idx = np.where(idata[elements].notna().all(axis = 1))[0]
e.g. Add a new attribute in header.json
"station_speed": {
"sections": ["c99_data4","c99_data5"],
"nan_filter": "any",
"elements": ["ship_speed", "ship_speed"],
"transform": "speed_converter",
"decimal_places": 2
}