%load_ext autoreload
%autoreload 2
import sys
sys.version
'3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0]'
Large datasets¶
The data map is great to work with the data directly, but to do large scale data manipulation or exploration for multiple runs or simulations it falls short.
To work with large datasets, duqtools uses xarray to store the data in a more managable format. This section shows how you can get the data into xarray. First, we define where to get the data from.
from duqtools.api import ImasHandle
paths = (
'g2vazizi/jet/94875/8000',
'g2vazizi/jet/94875/8001',
'g2vazizi/jet/94875/8002',
)
handles = [ImasHandle.from_string(p) for p in paths]
Getting variables¶
Many variables are pre-defined by duqtools
. These variables define the relations between the data. For more information, see the documentation on variables.
To get the data, we can use ImasHandle.get_variables()
and pass the names of the variables we are interested in. Note that they must all belong to the same root IDS (core_profiles
in this case).
variables = 'zeff', 't_i_ave', 'rho_tor_norm', 'time'
handle = handles[0]
ds = handle.get_variables(variables)
ds
<xarray.Dataset> Dimensions: (time: 101, rho_tor_norm: 100) Coordinates: * time (time) float64 48.35 48.36 48.36 48.37 ... 48.99 48.99 49.0 * rho_tor_norm (rho_tor_norm) float64 0.005025 0.01508 0.02513 ... 0.9899 1.0 Data variables: zeff (time, rho_tor_norm) float64 1.18 1.181 1.181 ... 1.41 1.436 t_i_ave (time, rho_tor_norm) float64 9.499e+03 9.493e+03 ... 643.8
Getting all variables¶
If you want to get all available pre-defined variables then you can use the get_all_variables
function, (extra variables can still be defined).
handle.get_all_variables(extra_variables=[], ids='core_profiles')
<xarray.Dataset> Dimensions: (time: 101, rho_tor_norm: 100, neutral: 3, ion: 3) Coordinates: * time (time) float64 48.35 48.36 48.36 48.37 ... 48.99 48.99 49.0 * rho_tor_norm (rho_tor_norm) float64 0.005025 0.01508 ... 0.9899 1.0 Dimensions without coordinates: neutral, ion Data variables: (12/17) omega_tor (time, rho_tor_norm) float64 7.481e+04 ... 1.169e+04 j_ohm (time, rho_tor_norm) float64 -2.717e+06 ... -1.269e+05 n_e (time, rho_tor_norm) float64 7.116e+19 ... 1.184e+19 j_tot (time, rho_tor_norm) float64 -2.749e+06 ... -6.49e+04 t_i_ave (time, rho_tor_norm) float64 9.499e+03 9.493e+03 ... 643.8 t_e (time, rho_tor_norm) float64 8.835e+03 8.826e+03 ... 262.2 ... ... n_i_tot (time, ion, rho_tor_norm) float64 6.968e+19 ... 1.027e+16 smag (time, rho_tor_norm) float64 0.000494 0.002472 ... 14.15 e_par (time, rho_tor_norm) float64 -0.003783 ... -0.01356 j_ni (time, rho_tor_norm) float64 -2.606e+04 ... -6.726e+04 e_r (time, rho_tor_norm) float64 -2.823e+03 ... -8.531e+04 collisionality (time, rho_tor_norm) float64 3.258 0.4021 ... 1.392 1.805
Defining custom variables¶
To explain a bit better how variables work, we will define custom relations between the data in the next cell. You can use this to define your own variables which are not available in duqtools
. Note that variable names and variable models can be mixed.
Define custom variables via the Variable
model.
This gives the name of the variable (user defined, can be anything), the IDS to grab it from, the path to the data, and the dimensions (all 3 part of the data spec).
The special character *
in the path denotes a dimension specified in the dims
list. In this example, the *
in the path refers to the first dimension defined in dims
: time
. In turn, time
must also be defined as a Variable.
Variables like zeff
and others use rho_tor_norm
as the coordinate. But, unfortunately, rho_tor_norm
differs slightly between time steps. Therefore we first assign this to a placeholder dimension by prefixing $
: $rho_tor_norm
. Duqtools knows to squash this dimension and make rho_tor_norm
consistent for all time steps.
If you are interested to see what the data would look like without squashing, you can pass squash=False
to handle.get_variables()
.
from duqtools.api import Variable
variable_models = (
Variable(name='zeff',
ids='core_profiles',
path='profiles_1d/*/zeff',
dims=['time', '$rho_tor_norm']),
Variable(name='t_i_ave',
ids='core_profiles',
path='profiles_1d/*/t_i_average',
dims=['time', '$rho_tor_norm']),
Variable(name='rho_tor_norm',
ids='core_profiles',
path='profiles_1d/*/grid/rho_tor_norm',
dims=['time', '$rho_tor_norm']),
Variable(name='time', ids='core_profiles', path='time', dims=['time']),
)
ds = handle.get_variables(variable_models)
ds
<xarray.Dataset> Dimensions: (time: 101, rho_tor_norm: 100) Coordinates: * time (time) float64 48.35 48.36 48.36 48.37 ... 48.99 48.99 49.0 * rho_tor_norm (rho_tor_norm) float64 0.005025 0.01508 0.02513 ... 0.9899 1.0 Data variables: zeff (time, rho_tor_norm) float64 1.18 1.181 1.181 ... 1.41 1.436 t_i_ave (time, rho_tor_norm) float64 9.499e+03 9.493e+03 ... 643.8
Standardize grid and time¶
IMAS data may not be on the same grid (i.e. x-values do not correspond between data sets) or use the same time steps. Therefore, the data must be standardized to the same set of reference coordinates so that the grid and time stamps correspond between different data sets. Because this is such a common operation, duqtools has helper functions to deal with these special cases. rebase_on_grid
helps to rebase on the grid, and rebase_on_time
to rebase on the time stamps. standardize_grid_and_time
combines these two functions and can make a sequence of datasets consistent.
from duqtools.ids import standardize_grid_and_time
variables = 'zeff', 't_i_ave', 'rho_tor_norm', 'time'
datasets = tuple(handle.get_variables(variables) for handle in handles)
datasets = standardize_grid_and_time(
datasets,
grid_var='rho_tor_norm',
time_var='time',
reference_dataset=0,
)
Grid rebasing¶
Alternatively, you may standardize the grid for each dataset separately using rebase_on_grid()
. In this way you can explicitly give the reference grid you want to interpolate to.
Note that this standardize_grid_and_time()
already performs this step.
from duqtools.ids import rebase_on_grid
datasets2 = tuple(handle.get_variables(variables) for handle in handles)
reference_grid = datasets2[0]['rho_tor_norm'].data
datasets2 = [
rebase_on_grid(ds, coord_dim='rho_tor_norm', new_coords=reference_grid)
for ds in datasets2
]
Time Standardizing¶
Sometimes we have datasets with various starting times, but we want to compare them anyway
for this you can use the rezero_time()
function, which is an in-place operation:
from duqtools.ids import rezero_time
for ds in datasets2:
rezero_time(ds, start=0.1)
Time rebasing¶
We can do the same for the time coordinate using rebase_on_time()
. Note that this standardize_grid_and_time()
already performs this step.
For example, if your reference data set has timestamps (1.0, 2.0, 3.0, 4.0)
, and another (1.0, 3.0, 5.0)
. The data in the second dataset will be interpolated to match the timestamps of the reference.
If you know your data have the same time stamps, for example if they are from the same set of simulations, you can skip this step.
from duqtools.ids import rebase_on_time
reference_time = datasets2[0]['time'].data
datasets2 = [
rebase_on_time(ds, time_dim='time', new_coords=reference_time)
for ds in datasets2
]
Data concatenation¶
Finally, we can concatenate along the run dimension. We set the run coordinates to the name of the data so they can be re-used later.
import xarray as xr
dataset = xr.concat(datasets, 'run')
dataset['run'] = list(paths)
Now we have the data in a nicely structured xarray dataset.
dataset
<xarray.Dataset> Dimensions: (run: 3, time: 101, rho_tor_norm: 100) Coordinates: * rho_tor_norm (rho_tor_norm) float64 0.005025 0.01508 0.02513 ... 0.9899 1.0 * time (time) float64 48.35 48.36 48.36 48.37 ... 48.99 48.99 49.0 * run (run) <U23 'g2vazizi/jet/94875/8000' ... 'g2vazizi/jet/9487... Data variables: zeff (run, time, rho_tor_norm) float64 1.18 1.181 ... 1.41 1.436 t_i_ave (run, time, rho_tor_norm) float64 9.499e+03 ... 760.9
Plotting¶
Now that we have standardized and rebased the grid and time coordinates, plotting and other operations on the data becomes straightforward.
xarray
has some built-in functionality to make plots using matplotlib.
dataset.isel(time=[0, 1, 2, 3]).plot.scatter('rho_tor_norm',
't_i_ave',
hue='run',
col='time',
marker='.')
<xarray.plot.facetgrid.FacetGrid at 0x2b69bc599540>
dataset['t_i_ave'].isel(time=[0, 1, 2, 3]).plot.line(
x='rho_tor_norm',
hue='run',
col='time',
)
<xarray.plot.facetgrid.FacetGrid at 0x2b69bd39d510>
Duqtools has some built-in plots as well, using altair, that are designed with larger data sets and interactivity in mind.
from duqtools.api import alt_line_chart
chart = alt_line_chart(dataset.isel(time=[0, 1, 2, 3]),
x='rho_tor_norm',
y='t_i_ave')
chart