The pre-processing steps (orange blocks in the [Workflow](workflow)) improve the [homogeneity](Workflow/Processing-of-IDs#homogenisation) of the `id` information across and within the different `dck`'s, corrects some pervasive miscoding in time information, and appends information required for duplicate selection to each report, including an estimate of the precision of the elements of the data record, the number of extant elements (after application of [quality control](Workflow/quality-control)), and the priority assigned to the `dck`. Potential report pairs are identified based on position and time, with some constraints on the closeness of variables within the report. The identification and paring of duplicates happens in the following stages of the [Workflow](workflow). 1. [`simple_dup.R`](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/rscripts/simple_dup.R) 2. [`get_pairs.R`](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/rscripts/get_pairs.R) 3. [`get_dups.R`](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/rscripts/get_dups.R) 4. [`merge_ids_year.R`](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/rscripts/merge_ids_year.R) First stage ----------- The first stage identifies duplicate records between the **ship data** and data taken by **different platform types** (e.g DRIFT, PLAT). This is done in [`simple_dup.R`](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/rscripts/simple_dup.R). The code considers the records as duplicates if they show a full match in date, time and position. Second stage ------------ The second stage identifies duplicate records within the ship data. Pairs the reports as duplicate if they have associated ship `id`'s. The candidate pairs are selected according to i) the number of matching elements (similar content of variables within a specific tolerance), ii) the `dck`'s, and iii) a comparison of the `id`'s. For more information regarding the selection criteria to consider records as a pair of duplicated information in [`get_pairs.R`](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/rscripts/get_pairs.R) see Table 7 and 8 of the [technical report](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/docs/C3S_D311a_Lot2.dup_doc_v3.pdf). Third stage ----------- At this stage we are able to count the number of duplicated records and flag the best according to a [quality control criteria](Workflow/quality-control). The duplicate pairs are also combine into groups. Each group of possible duplicates is then assessed for quality control. This process is important to account for known differences between `dck`'s that are not captured in the precision information of previous processing stages. Four stage ----------- Once the date/time/location parameter value duplicates have been identified and flagged, the next stage in the processing considers together the data that have associated `id`'s. Sometimes the link between `id`'s can be used to homogenise the `id`'s beyond the individual pairs, sometimes the link is specific to a particular pair of reports, particularly if one of the matched `id`'s is generic. `id` matches are therefore only considered within-group. At the end of the processing the suffix “_gN” is appended to the `id`'s, where N is the group number. More information on the group assignments by `dck` and `id` can be found in Tables 15 and 16 of the [technical report](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/docs/C3S_D311a_Lot2.dup_doc_v3.pdf). The linked `id`'s are then checked using the [MOQC track check](Workflow/Quality-control#met-office-track-check), and for time duplicates. Reports that fail the track check are flagged as a worst duplicate. Where positions are similar the best duplicate is selected by `dck` priority and number of elements with similar content of variables.