... | @@ -19,7 +19,7 @@ Second stage |
... | @@ -19,7 +19,7 @@ Second stage |
|
------------
|
|
------------
|
|
The second stage identifies duplicate records within the ship data. Pairs the reports as duplicate if they have associated ship `id`'s. The candidate pairs are selected according to i) the number of matching elements (similar content of variables within a specific tolerance), ii) the `dck`'s, and iii) a comparison of the `id`'s.
|
|
The second stage identifies duplicate records within the ship data. Pairs the reports as duplicate if they have associated ship `id`'s. The candidate pairs are selected according to i) the number of matching elements (similar content of variables within a specific tolerance), ii) the `dck`'s, and iii) a comparison of the `id`'s.
|
|
|
|
|
|
For more information regarding the selection criteria to consider records as a pair of duplicated information in [`new_get_pairs.R`](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/rscripts/new_get_pairs.R) see Table 7 and 8 of the [technical report]().
|
|
For more information regarding the selection criteria to consider records as a pair of duplicated information in [`new_get_pairs.R`](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/rscripts/new_get_pairs.R) see Table 7 and 8 of the [technical report](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/docs/C3S_D311a_Lot2.dup_doc_v3.pdf).
|
|
|
|
|
|
Third stage
|
|
Third stage
|
|
-----------
|
|
-----------
|
... | @@ -28,6 +28,6 @@ At this stage we are able to count the number of duplicated records and flag the |
... | @@ -28,6 +28,6 @@ At this stage we are able to count the number of duplicated records and flag the |
|
Four stage
|
|
Four stage
|
|
-----------
|
|
-----------
|
|
Once the date/time/location parameter value duplicates have been identified and flagged, the next stage in the processing considers together the data that have associated `id`'s. Sometimes the link between `id`'s can be used to homogenise the `id`'s beyond the individual pairs, sometimes the link is
|
|
Once the date/time/location parameter value duplicates have been identified and flagged, the next stage in the processing considers together the data that have associated `id`'s. Sometimes the link between `id`'s can be used to homogenise the `id`'s beyond the individual pairs, sometimes the link is
|
|
specific to a particular pair of reports, particularly if one of the matched `id`'s is generic. `id` matches are therefore only considered within-group. At the end of the processing the suffix “_gN” is appended to the `id`'s, where N is the group number. More information on the group assignments by `dck` and `id` can be found in Tables 15 and 16 of the [technical report]().
|
|
specific to a particular pair of reports, particularly if one of the matched `id`'s is generic. `id` matches are therefore only considered within-group. At the end of the processing the suffix “_gN” is appended to the `id`'s, where N is the group number. More information on the group assignments by `dck` and `id` can be found in Tables 15 and 16 of the [technical report](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/docs/C3S_D311a_Lot2.dup_doc_v3.pdf).
|
|
|
|
|
|
The linked `id`'s are then checked using the [MOQC track check](Workflow/Quality-control#met-office-track-check), and for time duplicates. Reports that fail the track check are flagged as a worst duplicate. Where positions are similar the best duplicate is selected by `dck` priority and number of elements with similar content of variables. |
|
The linked `id`'s are then checked using the [MOQC track check](Workflow/Quality-control#met-office-track-check), and for time duplicates. Reports that fail the track check are flagged as a worst duplicate. Where positions are similar the best duplicate is selected by `dck` priority and number of elements with similar content of variables. |
|
|
|
\ No newline at end of file |