More Data Analysis with Iris

Ok, so in the last notebook we loaded a single file and looked at what's in there. But Iris can do much more. In this notebook we'll look at loading a subset of MOGREPS-G data.

Loading Data

Last time we retrieved Met Office data by downloading from public links. Instead, this time lets use boto3, a Python module for interacting with Amazon Web Services, to download some files. Note that, to use Boto, you will need an AWS account, and you will need to include your credentials in your boto requests (more here).

We'll look at the model run started at midnight on June 01, 2016, taking the forecasts at 12:00, 15:00 and 18:00. For each forecast we'll grab the unpurturbed forecast (the 0th realisation) and three ensemble members - all in all, about 600mb of data.

First, lets generate the names of the relevant data objects.

In [1]:
def make_data_object_name(dataset_name, year, month, day, hour, realization, forecast_period):
    template_string = "prods_op_{}_{:02d}{:02d}{:02d}_{:02d}_{:02d}_{:03d}.nc"
    return template_string.format(dataset_name, year, month, day, hour, realization, forecast_period)
In [2]:
from itertools import product

realizations = [0, 1, 2, 3]

forecast_periods = [12, 15, 18]

key_names = [
    make_data_object_name('mogreps-g', 2016, 6, 1, 0, r, f) for r, f in product(realizations, forecast_periods)
]

key_names
Out[2]:
['prods_op_mogreps-g_20160601_00_00_012.nc',
 'prods_op_mogreps-g_20160601_00_00_015.nc',
 'prods_op_mogreps-g_20160601_00_00_018.nc',
 'prods_op_mogreps-g_20160601_00_01_012.nc',
 'prods_op_mogreps-g_20160601_00_01_015.nc',
 'prods_op_mogreps-g_20160601_00_01_018.nc',
 'prods_op_mogreps-g_20160601_00_02_012.nc',
 'prods_op_mogreps-g_20160601_00_02_015.nc',
 'prods_op_mogreps-g_20160601_00_02_018.nc',
 'prods_op_mogreps-g_20160601_00_03_012.nc',
 'prods_op_mogreps-g_20160601_00_03_015.nc',
 'prods_op_mogreps-g_20160601_00_03_018.nc']

Now to download those files using boto3. Note the environmental variables detailing your AWS credentials.

In [3]:
import os
import boto3
from botocore.client import Config

def boto_download(key_name, download_dir='./'):
    session = boto3.session.Session(aws_access_key_id=os.environ['ACCESS_KEY'], # These need to be set at environmental
                                    aws_secret_access_key=os.environ['SECRET_KEY']) # variables as per your AWS credentials
    s3 = session.resource('s3', config=Config(signature_version='s3v4'))
    bucket = s3.Bucket('mogreps-g')
    fpath = '{}{}'.format(download_dir, key_name)
    if os.path.isfile(fpath):
        print('{} already exists, skipping...'.format(fpath))
    else:
        with open(key_name, 'wb') as outf:
            print('Downloading {}...'.format(key_name))
            bucket.download_fileobj(Key=key_name, Fileobj=outf)
    return fpath
        
file_paths = [boto_download(key) for key in key_names]
Downloading prods_op_mogreps-g_20160601_00_00_012.nc...
Downloading prods_op_mogreps-g_20160601_00_00_015.nc...
Downloading prods_op_mogreps-g_20160601_00_00_018.nc...
Downloading prods_op_mogreps-g_20160601_00_01_012.nc...
Downloading prods_op_mogreps-g_20160601_00_01_015.nc...
Downloading prods_op_mogreps-g_20160601_00_01_018.nc...
Downloading prods_op_mogreps-g_20160601_00_02_012.nc...
Downloading prods_op_mogreps-g_20160601_00_02_015.nc...
Downloading prods_op_mogreps-g_20160601_00_02_018.nc...
Downloading prods_op_mogreps-g_20160601_00_03_012.nc...
Downloading prods_op_mogreps-g_20160601_00_03_015.nc...
Downloading prods_op_mogreps-g_20160601_00_03_018.nc...

And now to load them (including a callback to adjust ensemble realisation metadata - a common problem. See here for more on this). You will need Iris which you can install with !conda install -y -c conda-forge iris.

In [4]:
import iris

# I don't really know what this does but it stops a deprecation warning so :shrug:
iris.FUTURE.netcdf_promote = True

def format_cube(cube, field, filename):
    """
    This dataset is a raw dump of our archived data.
    As such we need to do some metadata parsing to make our lives easier.
    In this case we're just going to add an extra 'realization' coordinate,
    but we can put anything in here, as long as a cube is returned at the end.
    """
    if not cube.coords('realization'):
        realization = int(filename.split('_')[-2])
        realization_coord = iris.coords.AuxCoord(
            realization, standard_name='realization', var_name='realization')
        cube.add_aux_coord(realization_coord)
        return cube

# we can also pass constraints to the load function. In this case we'll load only one parameter
constraint = 'dew_point_temperature'

cube = iris.load_cube(file_paths, constraint=constraint, callback=format_cube)
print(cube)
dew_point_temperature / (K)         (realization: 4; time: 3; latitude: 600; longitude: 800)
     Dimension coordinates:
          realization                           x        -            -               -
          time                                  -        x            -               -
          latitude                              -        -            x               -
          longitude                             -        -            -               x
     Auxiliary coordinates:
          forecast_period                       -        x            -               -
     Scalar coordinates:
          forecast_reference_time: 2016-06-01 00:00:00
          height: 1.5 m
     Attributes:
          Conventions: CF-1.5
          STASH: m01s03i250
          source: Data from Met Office Unified Model
          um_version: 10.2

With a little help from our callback function Iris has loaded these 12 input files, extracted the dew_point_temperature parameter and merged the metadata to give us a unified view of the dataset.

Iris is lazy by default, so until we access the data the memory footprint is much smaller than the ~600Mb dataset. When we access a piece of data Iris will read from the individual source file. This is very powerful as it allows us to explore datasets that are much larger than will fit in memory. The downside is that if we delete the on disk files this cube will break.

In [5]:
cube.has_lazy_data()
Out[5]:
True

Interactive plots

The Holoviews and Geoviews libraries are powerful tools for creating interactive plots. They work directly with cubes, and will auto-create widgets for navigating these 3 / 4 dimensional datasets. I'm just going to cover the absolute basics here, but they're great tools for visualising all sorts of data.

In [6]:
!conda install -y -c ioam geoviews holoviews pandas
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /opt/miniconda3:

The following NEW packages will be INSTALLED:

    bkcharts:  0.2-py35_0                    

The following packages will be UPDATED:

    bokeh:     0.12.5-py35_1                  --> 0.12.6-py35_0          
    conda:     4.3.22-py35_0                  --> 4.3.23-py35_0          
    geoviews:  1.2.0-py35_0       ioam        --> 1.3.0-py35_0       ioam
    holoviews: 1.7-py35_0         ioam        --> 1.8.1-py35_0       ioam

The following packages will be SUPERSEDED by a higher-priority channel:

    pandas:    0.19.2-np110py35_1 conda-forge --> 0.18.1-np110py35_0     

The following packages will be DOWNGRADED:

    dask:      0.15.0-py35_0                  --> 0.13.0-py35_0          

bkcharts-0.2-p 100% |################################| Time: 0:00:05  25.84 kB/s
bokeh-0.12.6-p 100% |################################| Time: 0:00:07 528.76 kB/s
conda-4.3.23-p 100% |################################| Time: 0:00:00 695.04 kB/s
holoviews-1.8. 100% |################################| Time: 0:00:22 291.16 kB/s
geoviews-1.3.0 100% |################################| Time: 0:00:02 256.67 kB/s
In [7]:
import cartopy # the Met Office's open source map projection module
import datetime

import geoviews as gv
import holoviews as hv

hv.Dimension.type_formatters[datetime.datetime] = "%m/%y %Hh"

hv.notebook_extension()

First we'll convert our cube into a GeoViews 'Dataset' object. All we have to do is tell GeoViews about the dimensions of our data.

In [8]:
dataset = gv.Dataset(
    cube,
    kdims=['realization', 'time', 'latitude', 'longitude'],
    vdims=['dew_point_temperature'])
dataset
Out[8]:
:Dataset   [realization,time,latitude,longitude]   (dew_point_temperature)

Once we have a Dataset we can use the to method to render it in various ways, from box plots to maps like this one.

In [9]:
%%opts Image [colorbar=True fig_size=400 projection=cartopy.crs.PlateCarree()] (cmap='viridis')

dataset.to(gv.Image, ['longitude', 'latitude']) * gv.feature.coastline
Out[9]: