|
This repository consist in a collection of R-scripts to homogenise platform identifier information and to identify duplicate observations in the **International Comprehensive Ocean-Atmosphere Data Set** (ICOADS) marine data source.
|
|
This repository consist in a collection of R-scripts to homogenise platform
|
|
|
|
identifier information and to identify duplicate observations in the
|
|
ICOADS is the world most extensive surface marine meteorological data collection. Contains ocean surface and atmospheric observations from the 1600's to present and is still receiving more data every year. The data base is made up of observation reports from many different sources, there are several hundred combinations of the **DCK** (deck) and **SID** (sources) flags that indicate the origin of the data. Typically, **DCK** indicates the **type of data** (e.g. US Navy ships; Japanese Whaling Fleet) and **SID** provides more information about the data system or format (e.g. data stream extracted from the WMO global telecommunications systems, GIS).
|
|
**International Comprehensive Ocean-Atmosphere Data Set** (ICOADS) marine data source.
|
|
|
|
|
|
Sometimes a single DCK is associated with a single SID, sometimes a single DCK will contain several SID and vice versa, leading to a number of duplicated entries of meteorological observations.
|
|
ICOADS is the world most extensive surface marine meteorological data collection.
|
|
|
|
Contains ocean surface and atmospheric observations from the 1600's
|
|
Historically archives of marine data have been maintained by individual nations, and often these were shared so that the same observations appear in the archives of several nations. Truncated formats often did not contain sufficient information to identify the observations made by a particular ship or platform, and these compact formats sometimes converted or encoded data in different ways. For example, many observations do not have an identifier linking to the ship (**ID**) or platform (**pt**), and for those that do have such identifiers they may be different between data sources. The main types of duplicates are:
|
|
to present and is still receiving more data every year.
|
|
|
|
The data base is made up of observation reports from many different sources,
|
|
* Observations historically shared among national archives, likely to have different formats, precision, conversions and metadata.
|
|
there are several hundred combinations of the **DCK** (deck) and **SID** (sources)
|
|
|
|
flags that indicate the origin of the data.
|
|
|
|
Typically, **DCK** indicates the **type of data**
|
|
|
|
(e.g. US Navy ships; Japanese Whaling Fleet) and **SID** provides more information
|
|
|
|
about the data system or format
|
|
|
|
(e.g. data stream extracted from the WMO global telecommunications systems, GIS).
|
|
|
|
|
|
|
|
Sometimes a single DCK is associated with a single SID,
|
|
|
|
sometimes a single DCK will contain several SID and vice versa,
|
|
|
|
leading to a number of duplicated entries of meteorological observations.
|
|
|
|
|
|
|
|
Historically archives of marine data have been maintained by individual nations,
|
|
|
|
and often these were shared so that the same observations appear in the archives
|
|
|
|
of several nations. Truncated formats often did not contain sufficient information
|
|
|
|
to identify the observations made by a particular ship or platform,
|
|
|
|
and these compact formats sometimes converted or encoded data in different ways.
|
|
|
|
For example, many observations do not have an identifier linking to the ship
|
|
|
|
(**ID**) or platform (**pt**), and for those that do have such identifiers
|
|
|
|
they may be different between data sources. The main types of duplicates are:
|
|
|
|
|
|
|
|
* Observations historically shared among national archives,
|
|
|
|
likely to have different formats, precision, conversions and metadata.
|
|
|
|
|
|
* Re-ingestion of the same data more than once.
|
|
* Re-ingestion of the same data more than once.
|
|
|
|
|
... | @@ -16,7 +37,13 @@ Historically archives of marine data have been maintained by individual nations, |
... | @@ -16,7 +37,13 @@ Historically archives of marine data have been maintained by individual nations, |
|
|
|
|
|
* Planned redundancy, for example the ingestion of several near real time data streams.
|
|
* Planned redundancy, for example the ingestion of several near real time data streams.
|
|
|
|
|
|
There is already a protocol and other tools written in Python to read and perform some quality control on the data and to identify duplicate observations as described in [Freeman. *et al.,* (2017)](https://doi.org/10.1002/joc.4775). However the data processing in this repository offers additional quality control on the data, duplicate identification and linking of IDs between each pair of duplicate reports. Additionally provides an identification of the best duplicate by assessing the track (path in lat/lon) of the observation.
|
|
There is already a protocol and other tools written in Python to read and
|
|
|
|
perform some quality control on the data and to identify duplicate observations
|
|
|
|
as described in [Freeman. *et al.,* (2017)](https://doi.org/10.1002/joc.4775).
|
|
|
|
However, the processing methods in this repository offer additional quality control
|
|
|
|
on the data, duplicate identification and linking of IDs between each pair of duplicate reports.
|
|
|
|
It also provides an identification of the best duplicate by assessing the track
|
|
|
|
(path in lat/lon) of the observation.
|
|
|
|
|
|
References
|
|
References
|
|
----------
|
|
----------
|
... | | ... | |