Workflow

Data input

The following data is required to run the scripts in this repository:

ICOADS v3.0. Freeman. et al., (2017).
Metadata from WMO Publication 47. Kent. et al., (2007)
CLIWOC logbook IDs
Inventory of ship names in the US Maury Collection
generate_id (needs description)
Precision criteria file. An estimate of the precision of each key variable (e.g. sst, lat, lon) per dck, yr and or sid. This precision criteria is require in order to set tolerances when comparing variables from ICOADS (See the list of ICOADS variables used in this repository). Comparison of variables allows for a match between reports in the duplicate identification procedure.
Json files containing ITU callsign prefixes associated with a country.
seq IDS. (needs description)

Processing stages

The diagram below is a summary of the data processing workflow followed by the shell scripts defined in scr. Each block represents a main task done by one script in rscripts. The corresponding .R file name has been added in grey between each block. For more information on eah .R script, please look into the API reference page.

Orange blocks represent pre-processing tasks done to the ICOADS data, in order to:

Select data taken only by commercial ships, excluding specialist ship data sources, such as research vessels (For more information see the selection criteria).
Preprocessing of ID's to improve duplicate identification and linking of id's between each pair of duplicate reports.
Preformed quality control on the data to point out the best duplicate.

The rest of the blocks represent processing scripts that concentrate in the duplicates identification and matching of reports ID's.

More details on the data processing can be found in this technical report.

graph TB
A1[rscripts]

 id1[(ICOADS v3.0)] --> |split_by_type.R|id2[Separate records according <br> to the different platform types.]
 id2 --> |simple_dup.R|id3[Check for cross-type duplicates between <br> ship data and the different platform types. <br> Considers the records as duplicates if they <br> show matching date, time & position, <br> with DCK and ID specific selection criteria.]
 id3 --> |ship2plat.R|id4[Exclude non-ship data identified in <br> cross-type duplicate analysis.]
 id4 --> id5[(ICOADS SHIP data)]
 id5 --> |process_ships.R|id6[Reformat selected ship IDs to homogenize <br> information between DCKs. <br> Uses IDs from Pub. 47 metadata <br> in ID prioritisation. <br> Corrects dates & times.]
 id6 --> |get_pairs.R|id7[Groups  ship reports as potential duplicates <br> if the contents match within tolerance]
 id7 --> |get_dups.R|id8[Assesses the groups of potential duplicates, <br> accepting those where the ID match is appropriate.<br> Reports from DCK that are of lower quality, <br> or that are less complete, or that fail the <br> track check are flagged as the worst.]
 id8 --> |merge_ids_year.R|id9[Assesses IDs that have been associated <br> in previous processing to decide whether to replace <br> all IDs in the associated group with the preferred ID.]
 id9 --> |nrt_dup.R|id10[Process near-real-time data collected after 2014.]
 id10 --> |clean_data.R|id11[Runs track checking on data to produce <br> clean tracks for all IDs.]
 id11 --> |clean2track.R|id12[Selects data for ship-tracking software Carella et al. 2017,<br> choosing only data with missing or generic IDs.]
 id12 --> id13[(Output data)]

classDef pre-processing fill:#fcc679,stroke:#333,stroke-width:1px,font-size:16px,font-weight:100,text-align:center
classDef scripts fill:#8C929D,stroke:#333,stroke-width:1px,font-size:16px,font-weight:100,text-align:center
classDef rest fill:#e8eaf6,stroke:#333,stroke-width:1px,font-size:16px,font-weight:100,text-align:center
class id2,id3,id4 pre-processing;
class A1,id1,id5,id13 scripts;
class id6,id7,id8,id9,id10,id11,id12 rest;

Output data

Pending