This repository consist in a collection of R-scripts to homogenise platform identifier information and to identify duplicate observations in the International Comprehensive Ocean-Atmosphere Data Set (ICOADS) marine data source.
ICOADS is the world most extensive surface marine meteorological data collection. Contains ocean surface and atmospheric observations from the 1600's to present and is still receiving more data every year. The data base is made up of observation reports from many different sources, there are several hundred combinations of the DCK (deck) and SID (sources) flags that indicate the origin of the data. Typically, DCK indicates the type of data (e.g. US Navy ships; Japanese Whaling Fleet) and SID provides more information about the data system or format (e.g. data stream extracted from the WMO global telecommunications systems, GIS).
Sometimes a single DCK is associated with a single SID, sometimes a single DCK will contain several SID and vice versa, leading to a number of duplicated entries of meteorological observations.
Historically archives of marine data have been maintained by individual nations, and often these were shared so that the same observations appear in the archives of several nations. Truncated formats often did not contain sufficient information to identify the observations made by a particular ship or platform, and these compact formats sometimes converted or encoded data in different ways. For example, many observations do not have an identifier linking to the ship (ID) or platform (pt), and for those that do have such identifiers they may be different between data sources. The main types of duplicates are:
-
Observations historically shared among national archives, likely to have different formats, precision, conversions and metadata.
-
Re-ingestion of the same data more than once.
-
Data from near real time sources that can be replaced with higher quality delayed mode data.
-
Re-digitisation of logbooks, newer data likely to have higher precision, more metadata etc.
-
Planned redundancy, for example the ingestion of several near real time data streams.
There is already a protocol and other tools written in Python to read and perform some quality control on the data and to identify duplicate observations as described in Freeman. et al., (2017). However the data processing in this repository offers additional quality control on the data, duplicate identification and linking of IDs between each pair of duplicate reports. Additionally provides an identification of the best duplicate by assessing the track (path in lat/lon) of the observation.
References
Freeman. et al., (2017) Freeman, E., Woodruff, S.D., Worley, S.J., Lubker, S.J., Kent, E.C., Angel, W.E., Berry, D.I., Brohan, P., Eastman, R., Gates, L., Gloeden, W., Ji, Z., Lawrimore, J., Rayner, N.A., Rosenhagen, G. and Smith, S.R. (2017), ICOADS Release 3.0: a major update to the historical marine climate record. Int. J. Climatol., 37: 2211-2232. doi:10.1002/joc.4775