{
 "cells": [
  {
   "cell_type": "raw",
   "id": "e7b59e56-fae1-4511-891a-dedc31596a25",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    },
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    ".. index:: Encoding, pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1d7565f-cf6f-4e1d-ac56-73a2c22840fb",
   "metadata": {},
   "source": [
    "# Encoding ODB-2 Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cab47b84-9d4c-4939-8df0-c591d249d919",
   "metadata": {},
   "source": [
    "## Trivial Example"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24c5e67c-a67a-4b91-931c-235adfacc749",
   "metadata": {},
   "source": [
    "Given a pandas ``DataFrame`` to encode it, the data should simply be passed to ``encode_odb()`` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f92f8104-1459-4f9b-b1fd-6c7a9138821f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "sys.path.insert(0, os.path.abspath('../../..'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3bb95e4c-0d49-4128-aa61-2bb2103202e1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import pyodc as odc\n",
    "\n",
    "df = pd.read_csv('data-1.csv')\n",
    "\n",
    "odc.encode_odb(df, 'example-1.odb')"
   ]
  },
  {
   "cell_type": "raw",
   "id": "29cbb156-ef0a-4d35-a902-d461e73726be",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    },
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    ".. index:: Encoding; File Type"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca0d1957-9739-4278-9bdc-77b8683a0397",
   "metadata": {},
   "source": [
    "## File Type Object"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35a1ba64-252a-4462-9300-b46faf1507d3",
   "metadata": {},
   "source": [
    "Encoding of ODB-2 data works with file-like objects as well as with file names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "91a2541e-cb54-4c26-a377-9a1cddbb6104",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('example-1.odb', 'wb') as f:\n",
    "    odc.encode_odb(df, f)"
   ]
  },
  {
   "cell_type": "raw",
   "id": "3dfecf99-6079-409e-8013-6bbe519b9dea",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    },
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    ".. index:: Encoding; Columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22434345-6ad9-4ec3-a1ac-2f6304bab289",
   "metadata": {},
   "source": [
    "## Configuring Encoded Columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f714d8cc-e6e5-4a1d-a504-a5d12cd67cba",
   "metadata": {},
   "source": [
    "By default, **pyodc** will always encode ODB-2 data in a lossless manner. In particular, most values are encoded as 8-byte DOUBLE values.\n",
    "\n",
    "Typically, the encoder will automatically select a data type and corresponding encoder to use. This data type can be overridden by supplying a types dictionary, for example to encode a column as a 4-byte REAL value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8e7e7b96-ab36-4212-9075-90dd44f19a35",
   "metadata": {},
   "outputs": [],
   "source": [
    "odc.encode_odb(df, 'example-3.odb', types={'obsvalue@body': odc.REAL})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3eb73959-2f99-4243-b91d-76e34b56a4cb",
   "metadata": {},
   "source": [
    "The interrogation of the frame headers shows that the data type has changed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "51315f36-a733-4fe4-8766-730b9125ea1d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "original: DataType.DOUBLE\n",
      "updated:  DataType.REAL\n"
     ]
    }
   ],
   "source": [
    "r1 = odc.Reader('example-1.odb', aggregated=False)\n",
    "r3 = odc.Reader('example-3.odb', aggregated=False)\n",
    "\n",
    "print('original:', r1.frames[0].column_dict['obsvalue@body'].dtype)\n",
    "print('updated: ', r3.frames[0].column_dict['obsvalue@body'].dtype)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d88f6a5c-cbf0-4b26-b4ce-b03efbc9580c",
   "metadata": {},
   "source": [
    "Decoded data also confirms that the precision has been appropriately reduced:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c66de62d-e7e9-4466-aac3-49fa85fc00a6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \\\n",
      "0       1  20210420     stat00  0-12345-0-67890       0.000000   \n",
      "1       1  20210420     stat01  0-12345-0-67891      12.345600   \n",
      "2       1  20210420     stat02  0-12345-0-67892      24.691200   \n",
      "3       1  20210420     stat03  0-12345-0-67893      37.036800   \n",
      "4       1  20210420     stat04  0-12345-0-67894      49.382401   \n",
      "5       1  20210420     stat05  0-12345-0-67895      61.728001   \n",
      "6       1  20210420     stat06  0-12345-0-67896      74.073601   \n",
      "7       1  20210420     stat07  0-12345-0-67897      86.419197   \n",
      "8       1  20210420     stat08  0-12345-0-67898      98.764801   \n",
      "9       1  20210420     stat09  0-12345-0-67899     111.110397   \n",
      "\n",
      "   integer_missing  double_missing  \n",
      "0           1234.0           12.34  \n",
      "1           4321.0           43.21  \n",
      "2              NaN             NaN  \n",
      "3           1234.0           12.34  \n",
      "4           4321.0           43.21  \n",
      "5              NaN             NaN  \n",
      "6           1234.0           12.34  \n",
      "7           4321.0           43.21  \n",
      "8              NaN             NaN  \n",
      "9           1234.0           12.34  \n"
     ]
    }
   ],
   "source": [
    "df_decoded = odc.read_odb('example-3.odb', single=True)\n",
    "print(df_decoded)"
   ]
  },
  {
   "cell_type": "raw",
   "id": "5732017d-6808-46e6-9d80-fe856132c2aa",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    },
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    ".. index:: Encoding; Frame Structure"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7dda3322-6a9c-425e-b2a7-6ee07e2a22aa",
   "metadata": {},
   "source": [
    "## Configuring Frame Structure"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2ea4175-1fa2-4781-a0cd-a8164ef6e8d3",
   "metadata": {},
   "source": [
    "ODB-2 data is broken down into frames. By default a maximum of 10 000 rows of data will be encoded into each frame. If more than 10 000 rows are supplied, then the data will be split into a sequence of frames with at maximum 10 000 rows.\n",
    "\n",
    "To modify the threshold, pass ``rows_per_frame`` argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8c29dff9-2e62-4e8c-a6d3-17d46f4d6411",
   "metadata": {},
   "outputs": [],
   "source": [
    "odc.encode_odb(df, 'example-4.odb', rows_per_frame=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dad4c97e-6068-46e0-ac7d-eff6dcb1be59",
   "metadata": {},
   "source": [
    "Examination of the frame structure clearly shows that the data now contains multiple frames:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5d041e58-e9c8-479d-88d9-be5410ddbc38",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "original frames: [<pyodc.frame.Frame object at 0x7f85a83d99d0>]\n",
      "updated  frames: [<pyodc.frame.Frame object at 0x7f85b8818df0>, <pyodc.frame.Frame object at 0x7f85a83b6c70>, <pyodc.frame.Frame object at 0x7f85a8570940>, <pyodc.frame.Frame object at 0x7f85a8570be0>]\n",
      "original row counts: [10]\n",
      "updated  row counts: [3, 3, 3, 1]\n"
     ]
    }
   ],
   "source": [
    "r1 = odc.Reader('example-1.odb', aggregated=False)\n",
    "r4 = odc.Reader('example-4.odb', aggregated=False)\n",
    "\n",
    "print('original frames:', r1.frames)\n",
    "print('updated  frames:', r4.frames)\n",
    "\n",
    "print('original row counts:', [f.nrows for f in r1.frames])\n",
    "print('updated  row counts:', [f.nrows for f in r4.frames])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2eca1c5-0402-4b1a-95f1-13d4e0069f60",
   "metadata": {},
   "source": [
    "Despite these differences, if decoded the data is the same:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "31184a1d-2b5e-4677-b48b-9c97914b0c3f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \\\n",
      "0       1  20210420     stat00  0-12345-0-67890         0.0000   \n",
      "1       1  20210420     stat01  0-12345-0-67891        12.3456   \n",
      "2       1  20210420     stat02  0-12345-0-67892        24.6912   \n",
      "3       1  20210420     stat03  0-12345-0-67893        37.0368   \n",
      "4       1  20210420     stat04  0-12345-0-67894        49.3824   \n",
      "5       1  20210420     stat05  0-12345-0-67895        61.7280   \n",
      "6       1  20210420     stat06  0-12345-0-67896        74.0736   \n",
      "7       1  20210420     stat07  0-12345-0-67897        86.4192   \n",
      "8       1  20210420     stat08  0-12345-0-67898        98.7648   \n",
      "9       1  20210420     stat09         0-12345-       111.1104   \n",
      "\n",
      "   integer_missing  double_missing  \n",
      "0           1234.0           12.34  \n",
      "1           4321.0           43.21  \n",
      "2              NaN             NaN  \n",
      "3           1234.0           12.34  \n",
      "4           4321.0           43.21  \n",
      "5              NaN             NaN  \n",
      "6           1234.0           12.34  \n",
      "7           4321.0           43.21  \n",
      "8              NaN             NaN  \n",
      "9           1234.0           12.34  \n"
     ]
    }
   ],
   "source": [
    "df_decoded = odc.read_odb('example-4.odb', single=True)\n",
    "print(df_decoded)"
   ]
  },
  {
   "cell_type": "raw",
   "id": "085ad95e-8a77-4a07-8fcb-6ffd0352f049",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    },
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    ".. index:: Encoding; Additional Properties"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2023b97e-116c-4ac1-952c-7d503f9f0010",
   "metadata": {},
   "source": [
    "## Additional Properties"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e761fe0-24b6-47cc-83b3-2fc7b4af22a5",
   "metadata": {},
   "source": [
    "To encode additional properties as part of frame’s data, specify ``properties`` parameter to ``encode_odb()`` function with a dictionary value you want to include:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "3710d13c-7149-4046-bd75-abbbc73320f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "metadata = {\n",
    "    'encoded_by': 'ECMWF',\n",
    "    'data_source': 'pyodc_docs',\n",
    "}\n",
    "odc.encode_odb(df, 'example-1.odb', properties=metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1656deb7-fc79-4256-b881-94300061fcb7",
   "metadata": {},
   "source": [
    "Encoded properties are accessible via ``properties`` key of the frame object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a1564566-0fba-4ffe-bb87-ab9d8fa05f2f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{}]\n"
     ]
    }
   ],
   "source": [
    "r1 = odc.Reader('example-1.odb')\n",
    "print([f.properties for f in r1.frames])"
   ]
  },
  {
   "cell_type": "raw",
   "id": "d600330c-b166-4f8a-a37c-0e93226374e3",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    },
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    ".. index:: Encoding; Incompatible Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b2a8e8e-9ec4-46fe-8c48-7b2a15886fd6",
   "metadata": {},
   "source": [
    "## A Sequence of (Unrelated) Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8b59588-8595-4514-83af-932010559b0b",
   "metadata": {},
   "source": [
    "ODB-2 frames are self-contained and passed as a stream of data, which means there is no requirement that they are related with each other.\n",
    "\n",
    "For example, we can encode frames of two different structures (also known as *incompatible data*):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ccbc8aca-b023-4ff5-8105-68185f6409e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = pd.read_csv('data-2.csv')\n",
    "\n",
    "with open('example-2.odb', 'wb') as f:\n",
    "   odc.encode_odb(df, f)\n",
    "   odc.encode_odb(df2, f)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6542536-07f4-44f0-80c2-283e9c4b02ad",
   "metadata": {},
   "source": [
    "The trivial decoder will now result in a ``DataFrame`` with a substantial number of missing values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "6550d626-ee3e-40f4-8a25-b4ddd77d46c8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \\\n",
      "0        1  20210420     stat00  0-12345-0-67890         0.0000   \n",
      "1        1  20210420     stat01  0-12345-0-67891        12.3456   \n",
      "2        1  20210420     stat02  0-12345-0-67892        24.6912   \n",
      "3        1  20210420     stat03  0-12345-0-67893        37.0368   \n",
      "4        1  20210420     stat04  0-12345-0-67894        49.3824   \n",
      "5        1  20210420     stat05  0-12345-0-67895        61.7280   \n",
      "6        1  20210420     stat06  0-12345-0-67896        74.0736   \n",
      "7        1  20210420     stat07  0-12345-0-67897        86.4192   \n",
      "8        1  20210420     stat08  0-12345-0-67898        98.7648   \n",
      "9        1  20210420     stat09  0-12345-0-67899       111.1104   \n",
      "10       2  20210420     stat00             None         0.0000   \n",
      "11       2  20210420     stat01             None        12.3456   \n",
      "12       2  20210420     stat02             None        24.6912   \n",
      "13       2  20210420     stat03             None        37.0368   \n",
      "14       2  20210420     stat04             None        49.3824   \n",
      "15       2  20210420     stat05             None        61.7280   \n",
      "16       2  20210420     stat06             None        74.0736   \n",
      "17       2  20210420     stat07             None        86.4192   \n",
      "18       2  20210420     stat08             None        98.7648   \n",
      "19       2  20210420     stat09             None       111.1104   \n",
      "\n",
      "    integer_missing  double_missing  \n",
      "0            1234.0           12.34  \n",
      "1            4321.0           43.21  \n",
      "2               NaN             NaN  \n",
      "3            1234.0           12.34  \n",
      "4            4321.0           43.21  \n",
      "5               NaN             NaN  \n",
      "6            1234.0           12.34  \n",
      "7            4321.0           43.21  \n",
      "8               NaN             NaN  \n",
      "9            1234.0           12.34  \n",
      "10              NaN             NaN  \n",
      "11              NaN             NaN  \n",
      "12              NaN             NaN  \n",
      "13              NaN             NaN  \n",
      "14              NaN             NaN  \n",
      "15              NaN             NaN  \n",
      "16              NaN             NaN  \n",
      "17              NaN             NaN  \n",
      "18              NaN             NaN  \n",
      "19              NaN             NaN  \n"
     ]
    }
   ],
   "source": [
    "with open('example-2.odb', 'rb') as f:\n",
    "    df_decoded = odc.read_odb(f, single=True)\n",
    "\n",
    "print(df_decoded)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
