Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
I ICOADS R HOSTACE
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 7
    • Issues 7
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • brivas
  • ICOADS R HOSTACE
  • Wiki
    • Workflow
  • Matching criteria

Matching criteria · Changes

Page history
changed match criteria authored Jun 10, 2020 by bearecinos's avatar bearecinos
Hide whitespace changes
Inline Side-by-side
Showing with 11 additions and 63 deletions
+11 -63
  • Workflow/Matching-criteria.md Workflow/Matching-criteria.md +11 -63
  • No files found.
Workflow/Matching-criteria.md
View page @ f61f0168
A flag indicating whether an `id` match is allowed is added to each report by [INSERT-THE-LINK-TO-SCRIPT](). Generic `id`'s (e.g. blank, "SHIP", "MASKSTID") are allowed to match within a `dck`. The table below contains the information used to decide whether `id`'s in a pair are an allowed match. *Italics* in the table below represents the *“`id` type”*.
A flag indicating whether an `id` match is allowed is added to each report by [`new_add_match_id.R`](https://git.noc.ac.uk/brecinosrivas/icoads-r-hostace/-/blob/master/rutils/new_add_match_id.R). Generic `id`'s (e.g. blank, "SHIP", "MASKSTID") are allowed to match within a `dck`. Table 8 of the [technical document]() contains the information used to decide whether `id`'s in a pair are allowed to match.
________________
DCK | ID
:----- |:------------
Within any `dck` | blank, SHIP, MASKSTID
116, 117, 218 | any `id` to blank
150, 151, 152, 155, 156, 192, 193, 215, 720, 901 | any `id` to blank
128, 254, 720 | any `id` to blank
187, 196, 197, 229, 230, 720, 732 | any `id` to blank
227, 246, 732 | any `id` to blank
761, 898 | any `id` to blank
204, 245 | any `id` to blank
230, 254 | any `id` to blank
128, 230 | any `id` to blank
195, 281 | any `id` to blank
192, 193, 194, 201, 202, 706, 732 | any `id` to blank
194, 201, 202, 203, 207, 221, 223, 227, 233, 239, 254, 926 | substring; <br><br> or 1 digit `dck` 194 `id`
254, 926 | 2-5 characters of 254 match 3-6 characters of 926
194, 927 | "7- " in 194 with "00" in 227
194 with 207 or 227 | 3-6 characters of 194 match 2-5 characters of 207 or 227
194 with 194, 201, 202, 203, 207, 227 | DL = 1 or substring of length at least 3 and number of occurrences of one of the `id`'s = 1
194 with 201, 203, 207, 227 | substring
194, 201 | DL &lt;= 2 and one of the `id`'s classed as invalid
184, 209 | characters 5-8 of 184 with 2-5 or 209
555, 733 | add "N" at start of 555 <br><br> match of characters 2-4 <br><br> 555 `id` is SHIP
733 with 849, 888 | 849, 888 `id` is SHIP
733 with 888, 892 | 888, 892 `id` has DL &lt;= 1 with ROBB <br><br> 888, 892 `id` is EMIO, UYAJ, UFRE
186, 733 | 186 has 4 digit `id`, 733 is *north_pole_station*
750, 888 | 888 `id` is SHIP
781 with 128, 735, 849, 888, 926, 927| 781 is AAAA with callsign
927 with 230, 720 | 927 `id` is SHIP
213, 902 | 213 is characters 4-8 of 902
926 with 888, 892 | 888, 892 is characters 4-8 of 926
892, 926 | 892 is characters 1-4 of 926
117 | any match to *invalid* `id`
116, 117 | *id_over_X* to *id_minus*, match of characters 2-4 <br><br> characters 3-4 of 116 with 117 and 116 is *osv_onstation* <br><br> match of characters 1-3 and 116 is *osv_onstation* <br><br> match characters 2-3 of 116 with 1-2 of 117 and 116 is *osv_noship* <br><br> match of characters 1-4 and 116 is *other* <br><br> match of - at start of 116 with 5 at start of 117 <br><br> prepend 5 to 117 <br><br> prepend - to 116 <br><br> 1 digit `id` in each <br><br> match of start of 116 with 2 character 117 <br><br> within or between 116 and 117 when DL &lt;=2 when one `id` has 3 or fewer occurrences <br><br> 116 missing to extant 117 <br><br> 116 is osv_onstation and characters 3-4 are 00 with 117 `id` of length 4 <br><br> substring, one `id` has &lt;= 4 occurrences, the other &gt;= 10 occurrences
117 | prepend "-" to one of the `id`'s <br><br> DL = 1 if 3 or fewer occurrences of one `id`
116, 116| 22014, 22004
116, 226| *osv_noship* to *ows_logbook*
117, 218| prepend 0 to 117 and 218 is *us_ows_folio* <br><br> characters 1-3 of 117 with 2-4 of 218 is *us_ows_folio*
117, 128 | both 4 characters in length and match of characters 1-2 in 117 with 3-4 in 128
192, 215, 720 | match blank `id` <br><br> match characters 1-4 with 4 character `id` <br><br> allow letter as 5th character in 192 in 8 character `id` <br><br> one `id` *invalid* and not *id_5digit_pership* and not containing 0000 and DL&lt;=2 or substring <br><br> DL&lt;=2 and one `id` has &lt;= 3 occurrences and other has &gt;= 8
192, 215, 254, 720 | one is 5 character `id`, the other is not
246 | "PQP PTMNI" to "PORQUOIP"
762 | 2617A to 26174
128, 233, 254, 255, 555, 700, 708, 709, 732, 735, 749, 781, 792 ,849, 874, 875, 888, 889, 892, 926, 927, 992, 993, 995, 999hereafter "call.dcks" | subset<br><br> one is invalid and DL &lt;= 2 <br><br> DL &lt;= 2 and one has a single occurrence and the other at least 3 <br><br> one has a single occurrence and the other at least 20
call.dcks, 850 | SHIP, MASKSTID or AAAA to anything
call.dcks, 896 | SHIP to OWS
128, 555 | 128 platform type = 3 and 555 `id` starts 4Y
128, 230 with 555| ship number to call sign if at least 3 occurrences
128, 230 with 555, 720| matches with blank `id`
Any| match when 0 replaced with O; 0/O <br><br> I/J <br><br> UU/VV <br><br> U/V with DL=1 <br><br> WZC/WCZ
892, 896| replace C7O/C7 <br><br> MQR/C7R
992| replace XP42/MP42
700 with 792, 992| BBXX removed from start of `id`
711, 201| &gt; 3 occurrences
720, 734| &gt; 3 occurrences
246, 720| TERRANOVA to `id` starting 610426
193 with 705, 706, 707 | 705, 706, 707 starting NL or DN
118, 762 with 705, 706, 707 | 705, 706, 707 starting JP
203 with 705, 706, 707 | 705, 706, 707 starting UK
705, 706, 707 | with matching characters 1-2 of original `id`
703, 927| 927 `id` starts 05 and has &gt;=5 occurrences
\ No newline at end of file
These criteria have been developed by inspection of the
paired `id`'s and are therefore likely to be approximate.
Damerau–Levenshtein (DL) distance is the number of insertions, deletions and swaps necessary to convert one string to another ([Van der Loo M, 2014](https://journal.r-project.org/archive/2014-1/loo.pdf)). A substring is where one ID is contained within the other. Italics represents the “`id` type”.
References
-----
[Van der Loo M (2014)](https://journal.r-project.org/archive/2014-1/loo.pdf). The stringdist package for approximate string matching. The R Journal, 6, 111-122. https://CRAN.R-project.org/package=stringdist.
\ No newline at end of file
Clone repository

Wiki pages

Home

Introduction
Installation
JASMIN tips

Workflow
- Data selection
- Processing of ID's
- Matching criteria
- Quality control
- Duplicate identification

API Reference

Releases

Examples