%load_ext autoreload
%autoreload 2
Large datasets (2D data)¶
The data map is great to work with the data directly, but to do large scale data manipulation or exploration for multiple runs or simulations it falls short. This notebooks explores how 2D data can be handled.
To work with large datasets, duqtools uses xarray to store the data in a more managable format. This section shows how you can get the data into xarray. First, we define where to get the data from.
from duqtools.api import ImasHandle
paths = (
'g2vazizi/jet/94875/8000',
'g2vazizi/jet/94875/8001',
'g2vazizi/jet/94875/8002',
)
handles = [ImasHandle.from_string(p) for p in paths]
Defining variables¶
Then we must define the relations between the data. This is done via the Variable
model.
This gives the name of the variable (user defined), the ids to grab it from, the path to the data. Note that the time dimension can be defined by *
, which links this dimension to the variable named time
.
2D variables like psi
have two dimensions, in this case we use grid/dim1
and grid/dim2
. Because these may vary between time slices, we assign them to placeholder dimensions, $dim1
and $dim2
. These will be squashed to dim1
and dim2
when duqtools loads the variables.
from duqtools.api import Variable
variables = (
Variable(name='dim1',
ids='equilibrium',
path='time_slice/*/profiles_2d/0/grid/dim1',
dims=['time', '$dim1']),
Variable(name='dim2',
ids='equilibrium',
path='time_slice/*/profiles_2d/0/grid/dim2',
dims=['time', '$dim2']),
Variable(name='psi',
ids='equilibrium',
path='time_slice/*/profiles_2d/0/psi',
dims=['time', '$dim1', '$dim2']),
Variable(name='time', ids='equilibrium', path='time', dims=['time']),
)
As a side note, if we know that dim1
and dim2
do not vary per time slice, we can define the variables in a different way. In this case, we can skip dealing with placeholder dimensions. The difference is that for dim1
and dim2
, we do not set the time index (i.e. the path no longer contains the index *
) and the dimension name refers to itself.
dim_variables = (
Variable(name='dim1',
ids='equilibrium',
path='time_slice/0/profiles_2d/0/grid/dim1',
dims=['dim1']),
Variable(name='dim2',
ids='equilibrium',
path='time_slice/0/profiles_2d/0/grid/dim2',
dims=['dim2']),
Variable(name='psi',
ids='equilibrium',
path='time_slice/*/profiles_2d/0/psi',
dims=['time', 'dim1', 'dim2']),
Variable(name='time', ids='equilibrium', path='time', dims=['time']),
)
Loading the data¶
Whichever option we choose, the resulting datasets now have dim1
and dim2
as dimensions. The data within each group are interpolated to match this grid.
datasets = tuple(handle.get_variables(variables) for handle in handles)
datasets[0]
<xarray.Dataset> Dimensions: (time: 101, dim1: 101, dim2: 101) Coordinates: * time (time) float64 48.35 48.36 48.36 48.37 ... 48.98 48.99 48.99 49.0 * dim1 (dim1) float64 1.957 1.976 1.995 2.015 ... 3.805 3.824 3.843 3.862 * dim2 (dim2) float64 -1.528 -1.495 -1.463 -1.431 ... 1.617 1.65 1.682 Data variables: psi (time, dim1, dim2) float64 1.322e-10 1.322e-10 ... -1.659e-09
Grid rebasing¶
IMAS data may not be on the same grid (i.e. x-values do not correspond between data sets) or use the same time steps. Therefore, the data must be standardized to the same set of reference coordinates so that the grid and time stamps correspond between different data sets. Because this is such a common operation, duqtools has helper functions to deal with these special cases. rebase_on_grid
helps to rebase on the grid, and rebase_on_time
to rebase on the time stamps. standardize_grid_and_time
combines these two functions and can make a sequence of datasets consistent.
Now that all datasets have internally consistent dimensions, we can interpolate all datasets to the same reference grid. We could do this using two calls to rebase_on_grid()
...
from duqtools.ids import rebase_on_grid
for dim in ('dim1', 'dim2'):
reference_grid = datasets[0][dim].data
datasets = [
rebase_on_grid(ds, coord_dim=dim, new_coords=reference_grid)
for ds in datasets
]
...but also using xarray.Dataset
directly. Here we change the grid to one of our own choosing.
import numpy as np
_ = [
ds.interp({
'dim1': np.linspace(2.0, 4.0, 51),
'dim2': np.linspace(-1.5, 1.5, 51)
}) for ds in datasets
]
Time Standardizing¶
Sometimes we have datasets with various starting times, but we want to compare them anyway
for this you can use the standardize_time()
function, which is an in-place operation:
from duqtools.ids import standardize_time
for ds in datasets:
standardize_time(ds, start=0.1)
Time rebasing¶
We do the same for the time coordinate using rebase_on_time()
.
If you know your data have the same time stamps, for example if they are from the same set of simulations, you can skip this step. Here we take the first dataset as the reference.
from duqtools.ids import rebase_on_time
reference_time = datasets[0]['time'].data
datasets = [
rebase_on_time(ds, time_dim='time', new_coords=reference_time)
for ds in datasets
]
Data concatenation¶
Finally, we can concatenate along the run dimension. We set the run coordinates to the name of the data so they can be re-used later.
import xarray as xr
dataset = xr.concat(datasets, 'run')
dataset['run'] = list(paths)
Now we have the data in a nicely structured xarray dataset.
dataset
<xarray.Dataset> Dimensions: (run: 3, time: 101, dim1: 101, dim2: 101) Coordinates: * dim1 (dim1) float64 1.957 1.976 1.995 2.015 ... 3.805 3.824 3.843 3.862 * dim2 (dim2) float64 -1.528 -1.495 -1.463 -1.431 ... 1.617 1.65 1.682 * time (time) float64 0.1 0.1077 0.1149 0.1209 ... 0.7389 0.7449 0.75 * run (run) <U23 'g2vazizi/jet/94875/8000' ... 'g2vazizi/jet/94875/8002' Data variables: psi (run, time, dim1, dim2) float64 1.322e-10 1.322e-10 ... -1.414e-10
Plotting¶
Now that we have standardized and rebased the grid and time coordinates, plotting and other operations on the data becomes straightforward.
xarray
has some built-in functionality to make plots using matplotlib.
dataset['psi'].isel(time=[0, 1, 2, 3]).plot(row='time', col='run')
<xarray.plot.facetgrid.FacetGrid at 0x2b1461c51de0>
Data reduction¶
To reduce the data along some dimension, we can use dataset.reduce()
. This method takes a function as the first argument, and will apply it for each slice in the given dimension. xarray
has some shortcuts for common operators, so dataset.reduce(np.mean, dim='run')
is equivalent to dataset.mean(dim='run')
.
mean = dataset.mean(dim='run').isel(time=[0, 1, 2, 3])
mean['psi'].plot(col='time')
<xarray.plot.facetgrid.FacetGrid at 0x2b14616113c0>
This can be used to calculate the uncertainty in different regions of the 2D map.
std = dataset.isel(time=[0, 1, 2, 3]).std(dim='run')
std['psi'].plot(col='time')
<xarray.plot.facetgrid.FacetGrid at 0x2b1461611cc0>