Duplicate indentification

This is an old version of this page. You can view the most recent version or browse the history.

The pre-processing steps (orange blocks in the Workflow) improve the homogeneity of the id information across and within the different dck's, corrects some pervasive miscoding in time information, and appends information required for duplicate selection to each report, including an estimate of the precision of the elements of the data record, the number of extant elements (after application of quality control), and the priority assigned to the dck. Potential report pairs are identified based on position and time, with some constraints on the closeness of variables within the report.

The identification and paring of duplicates happens in the following stages of the Workflow.

First stage

The first stage identifies duplicate records between the ship data and data taken by different platform types (e.g DRIFT, PLAT). This is done in simple_dup.R. The code considers the records as duplicates if they show a full match in date, time and position.

Second stage

The second stage identifies duplicate records within the ship data. Pairs the reports as duplicate if they have associated ship id's. The candidate pairs are selected according to i) the number of matching elements (similar content of variables within a specific tolerance), ii) the dck's, and iii) a comparison of the id's.

For more information regarding the selection criteria to consider records as a pair of duplicated information in new_get_pairs.R see Table 7 and 8 of the technical report.

Third stage

At this stage we are able to count the number of duplicated records and flag the best according to a quality control criteria. The duplicate pairs are also combine into groups. Each group of possible duplicates is then assessed for quality control. This process is important to account for known differences between dck's that are not captured in the precision information of previous processing stages.

Four stage

Once the date/time/location parameter value duplicates have been identified and flagged, the next stage in the processing considers together the data that have associated id's. Sometimes the link between id's can be used to homogenise the id's beyond the individual pairs, sometimes the link is specific to a particular pair of reports, particularly if one of the matched id's is generic. id matches are therefore only considered within-group. At the end of the processing the suffix “_gN” is appended to the id's, where N is the group number. More information on the group assignments by dck and id can be found in Tables 15 and 16 of the technical report.

The linked id's are then checked using the MOQC track check, and for time duplicates. Reports that fail the track check are flagged as a worst duplicate. Where positions are similar the best duplicate is selected by dck priority and number of elements with similar content of variables.