{ "cells": [ { "cell_type": "markdown", "id": "nutritional-indie", "metadata": {}, "source": [ "## Generating CLIWOC missing code tables " ] }, { "cell_type": "markdown", "id": "grateful-coalition", "metadata": {}, "source": [ "The Climatological Database for the World's Oceans 1750-1850 ([CLIWOC](https://stvno.github.io/page/cliwoc/)) has valuable information on its supplemental data stored in the [IMMA](https://icoads.noaa.gov/e-doc/imma/R3.0-imma1.pdf) format under the C99 column. \n", "\n", "We have successfully extracted this information with the [mdf_reader()](https://git.noc.ac.uk/brecinosrivas/mdf_reader) tool, but several important variables are missing their code tables. \n", "\n", "List of variables: \n", "\n", "- Ship types\n", "- latitude indicator\n", "- longitude indicator,\n", "- air temperature units\n", "- sst units\n", "- air pressure units\n", "- units of attached thermometer\n", "- longitude units\n", "- Barometer type\n", "\n", "According to the [documentation](https://stvno.github.io/page/cliwoc/) of this deck (730) there are up to 20 different ways of writing down the air pressure but the code tables are not available anymore on the website. Therefore, we extracted from the supplemental data all possible entries for those fields which are missing a code table. We count each entry in order to construct a code table for that particular variable.\n", "\n", "The code to extract multiple variables from the CLIWOC supplemental data can be found [here](https://git.noc.ac.uk/brecinosrivas/mdf_reader/-/blob/master/tests/gather_stats_c99.py)\n", "\n", "\n", "### Set up " ] }, { "cell_type": "code", "execution_count": 1, "id": "described-wallet", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import pandas as pd\n", "import numpy as np\n", "import pickle\n", "from collections import defaultdict\n", "import matplotlib.pyplot as plt\n", "import matplotlib.dates as mdates\n", "import seaborn as sns\n", "# PARAMS for plots\n", "from matplotlib import rcParams\n", "sns.set_style(\"whitegrid\")\n", "rcParams['axes.labelsize'] = 14\n", "rcParams['xtick.labelsize'] = 14\n", "rcParams['ytick.labelsize'] = 14\n", "rcParams['legend.fontsize'] = 14\n", "rcParams['legend.title_fontsize'] = 14" ] }, { "cell_type": "markdown", "id": "ranging-ferry", "metadata": {}, "source": [ "We stored the statistics per year in python pickle dictionaries." ] }, { "cell_type": "code", "execution_count": 2, "id": "favorite-maria", "metadata": {}, "outputs": [], "source": [ "# Paths to data\n", "dirs = '/Users/brivas/c3s_work/mdf_reader/tests/data/133-730/133-730'\n", "file_names = sorted(os.listdir(dirs))" ] }, { "cell_type": "code", "execution_count": 3, "id": "recent-knife", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['1662.pkl', '1663.pkl', '1677.pkl', '1699.pkl', '1745.pkl']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file_names[0:5]" ] }, { "cell_type": "code", "execution_count": 4, "id": "traditional-terminal", "metadata": {}, "outputs": [], "source": [ "def get_values(dic, key, year):\n", " \"\"\"\n", " Get individual sets of values from the pickle df\n", " Params:\n", " ------\n", " dic: python dictionary containing all variables stats per year\n", " key: variable name \n", " year: year to extract\n", " Returns:\n", " --------\n", " indexes: these are the variable types (e.g. barque or nan)\n", " series.values: these are the counts of how many that variable name gets repeated\n", " year: year to sample\n", " \"\"\"\n", " series = dic[key]\n", " indexes = series.index.values\n", " year = np.repeat(year, len(indexes))\n", " return indexes, series.values, year" ] }, { "cell_type": "code", "execution_count": 5, "id": "forward-context", "metadata": {}, "outputs": [], "source": [ "def exptract_year_arrays(path_to_file, key):\n", " \"\"\"\n", " Reads pickle file and extracts the variable arrays per year\n", " Parms:\n", " -----\n", " path_to_file: path to the pickle file\n", " key: variable to extract\n", " Returns:\n", " --------\n", " df: dataframe from get_df\n", " \n", " \"\"\"\n", " with open(path_to_file, 'rb') as handle:\n", " base = os.path.basename(path_to_file)\n", " year = os.path.splitext(base)[0]\n", " dic_pickle = pickle.load(handle)\n", " df = get_values(dic_pickle, key, year)\n", " return df" ] }, { "cell_type": "code", "execution_count": 6, "id": "ongoing-franchise", "metadata": {}, "outputs": [], "source": [ "def make_data_frame(list_of_files, main_directory, key):\n", " \n", " # Define empty arrays to store the data \n", " years=np.array([])\n", " types_of_var = np.array([])\n", " counts_var = np.array([])\n", " \n", " for file in list_of_files:\n", " full_path = os.path.join(main_directory, file)\n", " var_type, count, year_f = exptract_year_arrays(full_path, key)\n", " years = np.concatenate([years, year_f])\n", " types_of_var = np.concatenate([types_of_var, var_type])\n", " counts_var = np.concatenate([counts_var, count])\n", " \n", " dataset = pd.DataFrame({'Year': years, \n", " key: types_of_var, 'Count': counts_var})\n", " \n", " return dataset" ] }, { "cell_type": "code", "execution_count": 7, "id": "sufficient-jacob", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/Users/brivas/c3s_work/mdf_reader/tests/data/133-730/133-730'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dirs" ] }, { "cell_type": "code", "execution_count": 8, "id": "settled-tribune", "metadata": {}, "outputs": [], "source": [ "# List of variables names stored in the pickle files \n", "dic_keys = ['ship_types', \n", " 'lan_inds', # in a silly mistake I wrote lat wrong in the output data set. Oh well\n", " 'lon_inds', \n", " 'at_units', \n", " 'sst_units', \n", " 'ap_units', \n", " 'bart_units', \n", " 'lon_units', \n", " 'baro_types']\n", "\n", "df_ships = make_data_frame(file_names, dirs, dic_keys[0]).dropna()\n", "df_lati = make_data_frame(file_names, dirs, dic_keys[1]).dropna()\n", "df_loni = make_data_frame(file_names, dirs, dic_keys[2]).dropna()\n", "df_atu = make_data_frame(file_names, dirs, dic_keys[3]).dropna()\n", "df_sstu = make_data_frame(file_names, dirs, dic_keys[4]).dropna()\n", "df_apu = make_data_frame(file_names, dirs, dic_keys[5]).dropna()\n", "df_bartu = make_data_frame(file_names, dirs, dic_keys[6]).dropna()\n", "df_lonu = make_data_frame(file_names, dirs, dic_keys[7]).dropna()\n", "df_barot = make_data_frame(file_names, dirs, dic_keys[8]).dropna()" ] }, { "cell_type": "markdown", "id": "pressing-discovery", "metadata": {}, "source": [ "- Ship types \n", "\n", "Something might be off here in the `mdf_reader()` .json file because some of the ship types seem cut off. Also I guess the plot that I had before made more sense with less data." ] }, { "cell_type": "code", "execution_count": 9, "id": "occasional-abuse", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GALJOOT\n", "NAVIO\n", "5TH RATE\n", "6TH RATE\n", "FRIGATE\n", "SHIP O.T. LINE\n", "SLOOP\n", "FREGAT\n", "4TH RATE\n", "SNOW\n", "3RD RATE\n", "FR�GATE\n", "FREGATE\n", "FREGATTE\n", "2ND RATE\n", "NAV�O\n", "SNAUW\n", "BOMB VESSEL, SL\n", "OORLOGSSCHIP\n", "BOMB/EXPLORATIO\n", "OORLOGSSNAUW\n", "SLOOP (?)\n", "STORESHIP\n", "BRIK\n", "PAQUEBOTE\n", "CUTTER\n", "FRAGATA\n", "FRAGATA CORREO\n", "PAQUEBOT\n", "BALANDRA\n", "BARK\n", "BERGANTIN\n", "SLOOP, THREE MA\n", "6TH RATE FRIGAT\n", "SPIEGELRETOURSC\n", "TRANSPORT\n", "EXPLORATION VES\n", "MERCHANT BRIG\n", "CHAMBEQU�N\n", "BUQUE\n", "FRAGATA DE GUER\n", "FIRESHIP\n", "SNAAUW\n", "NAV�O DE LA REA\n", "BRIG\n", "ADVIJSJAGT\n", "KOTTER\n", "7TH RATE\n", "BRIGANTIJN\n", "8TH RATE\n", "CORVETTE\n", "COTTER\n", "GABARRE\n", "BRIG/SLOOP\n", "PINK\n", "BARGENTIJN\n", "HOEKERSCHIP\n", "L'AVISO\n", "FLUTE\n", "GOLETA GUARDA C\n", "HOEKER\n", "CORVETA\n", "FLUIT\n", "POLACRA\n", "WHALER\n", "PAKKETBOOT (BRI\n", "ARMED STORESHIP\n", "SLOEP\n", "SCHOENER\n", "PACKET SHIP\n", "KORVET\n", "STORE SHIP\n", "TROOP SHIP\n", "CORVET\n", "LINIESCHIP\n", "KORVET V OORLOG\n", "BRIK VAN OORLOG\n", "KOOPVAARDER\n", "FREGATSCHIP\n", "STEAMPOWERED WA\n", "TRANSPORTSCHIP\n", "GOLETA\n", "KORVET VAN OORL\n", "BRICK\n", "ORVET M\n", "STEAMER\n", "SCHOENERBRIK\n", "MISTICO\n", "STOOMSCHIP\n", "FALUCHO\n" ] } ], "source": [ "types_of_ships = df_ships.ship_types.unique()\n", "for t in types_of_ships:\n", " print(t)" ] }, { "cell_type": "markdown", "id": "waiting-syndicate", "metadata": {}, "source": [ "For example that `ORVET M` ? It looks that maybe that string does not belong to the ship_types field? \n", "- AT units \n", "\n", "This works perfect" ] }, { "cell_type": "code", "execution_count": 10, "id": "communist-selection", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['CELSIUS', 'FAHRENHEIT', 'REAMUR', 'REAUMUR', 'AHRENHEIT'],\n", " dtype=object)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_atu.at_units.unique()" ] }, { "cell_type": "markdown", "id": "unnecessary-binary", "metadata": {}, "source": [ "- SST units \n", "\n", "This one not so much, and from all the years only those had data." ] }, { "cell_type": "code", "execution_count": 11, "id": "welsh-legend", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Year | \n", "sst_units | \n", "Count | \n", "
---|---|---|---|
31 | \n", "1772 | \n", "FAHRENHEIT | \n", "2.0 | \n", "
33 | \n", "1773 | \n", "FAHRENHEIT | \n", "1.0 | \n", "
81 | \n", "1820 | \n", "FAHRENHEIT | \n", "3.0 | \n", "
99 | \n", "1837 | \n", "FAHRENHEIT | \n", "15.0 | \n", "
101 | \n", "1838 | \n", "D | \n", "1.0 | \n", "
111 | \n", "1847 | \n", "FAHRENHEIT | \n", "16.0 | \n", "
117 | \n", "1852 | \n", "FAHRENHEIT | \n", "32.0 | \n", "
119 | \n", "1853 | \n", "FAHRENHEIT | \n", "51.0 | \n", "