This repository is a collection of R-scripts to homogenise platform
identifier information and to identify duplicate observations in the
International Comprehensive Ocean-Atmosphere Data Set (ICOADS) marine data source. Text in this format
denotes an ICOADS variable name (see API-reference for variables information).
ICOADS is the world most extensive surface marine meteorological data collection.
Contains ocean surface and atmospheric observations from the late 1600's
to present and is updated every month with observations from near-real-time data streams.
The data base is made up of observation reports from many different sources,
there are several hundred combinations of the dck
(deck) and sid
(sources)
flags that indicate the origin of the data.
Typically, dck
indicates the type of data
(e.g. US Navy ships; Japanese Whaling Fleet) and sid
provides more information
about the data system or format
(e.g. data stream extracted from the WMO global telecommunications systems, GIS).
Sometimes a single dck
is associated with a single sid
,
sometimes a single dck
will contain several sid
and vice versa,
not all of the dck
and sid
are independent so there can be duplicated reports which need to be identified and flagged.
Historically archives of marine data have been maintained by individual nations,
and often these were shared so that the same observations appear in the archives
of several nations. Truncated formats often did not contain sufficient information
to identify the observations made by a particular ship or platform,
and these compact formats sometimes converted or encoded data in different ways.
For example, many observations do not have an identifier linking to the ship
(id
) or platform (pt
), and for those that do have such identifiers
they may be different between data sources. The main types of duplicates are:
-
Observations historically shared among national archives, likely to have different formats, precision, conversions and metadata.
-
Re-ingestion of the same data more than once.
-
Data from near real time sources that can be replaced with higher quality delayed mode data.
-
Re-digitisation of logbooks, newer data likely to have higher precision, more metadata etc.
-
Planned redundancy, for example the ingestion of several near real time data streams.
The processing software used by ICOADS (https://icoads.noaa.gov/software/) is written in FORTRAN and includes code to translate software to the IMMA1 format Smith. et al., (2016), to apply QC and flags, and to identify (and in earlier releases remove) reports likely to be duplicates Freeman. et al., (2017).
The code in this repository offers additional quality control on the data, homogenisation of ID information between different dck
and sid
and duplicate identification (DI) preserving information on reports associated by the DI through the use of ICOADS unique identifiers (uid
).
References
Freeman. et al., (2017) Freeman, E., Woodruff, S.D., Worley, S.J., Lubker, S.J., Kent, E.C., Angel, W.E., Berry, D.I., Brohan, P., Eastman, R., Gates, L., Gloeden, W., Ji, Z., Lawrimore, J., Rayner, N.A., Rosenhagen, G. and Smith, S.R. (2017), ICOADS Release 3.0: a major update to the historical marine climate record. Int. J. Climatol., 37: 2211-2232. doi:10.1002/joc.4775
Smith. et al., (2016) Smith, S.R., Freeman, E., Lubker, S.J., Woodruff, S.D., Worley, S.J., Angel, W.E., Berry, D.I., Brohan, P., Ji, Z. and Kent, E.C., 2016. The International Maritime Meteorological Archive (IMMA) Format. Unpublished document.