Context and Metadata system

Overview

The Context and Metadata system is intended to provide easy programmatic access to combinations of time-ordered detector data and supporting data products (“metadata”). The section on Context deals with the configuration files that describe a dataset, and the use of the Context object to load data and metadata into sotodlib containers; the section on Metadata describes how to store and index metadata information for use with this system. Information on TOD indexing and loading can be found in the section on ObsFileDb.

Context

Assuming someone has set up a Context for a particular dataset, you would instantiate it like this:

from sotodlib.core import Context
ctx = Context('path/to/the/context.yaml')

This will cause the specified context.yaml file to be parsed, as well as user and site configuration files, if those have been set up. Once the Context is loaded, you will probably have access to the various databases (detdb, obsdb, obsfiledb) and you’ll be able to load TOD and metadata using get_obs(...).

Dataset, User, and Site Context files

The three configuration files are:

Dataset Context File

This will often be called context.yaml and will contain a description of how to access a particular TOD data set (for example a set of observations output from a particular simulation run) and supporting metadata (including databases listing the observations, detector properties, and intermediate analysis products such as cuts or pointing offsets). When instantiating a Context object, the path to this file will normally be the only argument.

Site Context File

This file contains settings that are common to a particular computing site but are likely to differ from one site to the next. It is expected that a single file will be made available to all sotodlib users on a system. The main purpose of this is to describe file locations, so that the Dataset Context File can be written in a way that is portable between computing sites.

User Context File

This is a yaml file contains user-specific settings, and is loaded after the Site Context File but before the Dataset Context File. This plays the same role as Site Context but allows for per-user tweaking of parameters.

The User and Site Context Files will be looked for in certain places if not specified explicitly; see Context for details.

Annotated Example

The Dataset Context File is yaml, that decodes to a dict. Here is an annotated example.

# Define some "tags".  These are string variables that are eligible
# for substitution into other settings.  The most common use is to
# establish base paths for data and metadata products.
tags:
  actpol_shared:   /mnt/so1/shared/data/actpol/depots/scratch
  depot_scratch:  /mnt/so1/shared/data/actpol/depots/scratch
  metadata_lib:   /mnt/so1/shared/data/actpol/depots/scratch/uranus_200327/metadata

# List of modules to import.  Importing modules such as
# moby2.analysis.socompat or sotodlib.io.metadata causes certain IO
# functions to be registered for the metadata system to use.
imports:
  - sotodlib.io.metadata
  - moby2.analysis.socompat

# The basic databases.  The sotodlib TOD loading system uses an
# ObsFileDb to determine what files to read.  The metadata association
# and loading system uses a DetDb and an ObsDb to find and read
# different kinds of metadata.
obsfiledb: '{metadata_lib}/obsfiledb_200407.sqlite'
detdb:     '{metadata_lib}/detdb_200109.sqlite'
obsdb:     '{metadata_lib}/obsdb_200618.sqlite'

# Additional settings related to TOD loading.
obs_colon_tags: ['band']
obs_loader_type: actpol_moby2

# A list of metadata products that should be loaded along with the
# TOD.  Each entry in the list points to a metadata database (an
# sqlite file) and specifies the name under which that information
# should be associated (unpacked) in the loaded data structure.
# The entries here are all fairly simple -- see documentation for
# more complex examples.
metadata:
  - db: "{metadata_lib}/cuts_s17_c11_200327_cuts.sqlite"
    unpack: "glitch_flags&flags"
  - db: "{metadata_lib}/cuts_s17_c11_200327_planet_cuts.sqlite"
    unpack: "source_flags&flags"
  - db: "{metadata_lib}/cal_s17_c11_200327.sqlite"
    unpack: "relcal&cal"
  - db: "{metadata_lib}/timeconst_200327.sqlite"
    unpack: "timeconst&"
    loader: "PerDetectorHdf5"
  - db: "{metadata_lib}/abscal_190126.sqlite"
    unpack: "abscal&cal"
  - db: "{metadata_lib}/detofs_200218.sqlite"
    unpack: "focal_plane"
  - db: "{metadata_lib}/pointofs_200218.sqlite"
    unpack: "pointofs"

With a context like the one above, a user can load a TOD and its supporting data very simply:

from sotodlib.core import Context
from moby2.analysis import socompat   # For special ACT loader functions

context = Context('context.yaml')

# Get a random obs_id from the ObsDb we loaded:
context.obsdb.get()[10]
# output is: OrderedDict([('obs_id', '1500022312.1500087647.ar6'), ('timestamp', 1500022313.0)])

# Load the TOD and metadata for that obs_id:
tod = context.get_obs('1500022312.1500087647.ar6')

# The result is an AxisManager with members [axes]:
# - signal ['dets', 'samps']
# - timestamps ['samps']
# - flags ['samps']
# - boresight ['samps']
# - array_data ['dets']
# - glitch_flags ['dets', 'samps']
# - source_flags ['dets', 'samps']
# - relcal ['dets']
# - timeconst ['dets']
# - abscal ['dets']
# - focal_plane ['dets']
# - pointofs ['dets']

Context Schema

Here are all of the top-level entries with special meanings in the Context system:

tags

A map from string to string. This entry is treated in a special way when the Site, User, and Dataset context files are evalauted in series; see below.

imports

A list of modules that should be imported prior to attempting any metadata operations. The purpose of this is to allow IO functions to register themselves for use by the Metadata system. This list will usually need to include at least sotodlib.io.metadata.

obsfiledb, obsdb, detdb

Each of these should provide a string. The string represents the path to files carrying an ObsFileDb, ObsDb, and DetDb, respectively. These are all technically optional but it will be difficult to load TOD data with the ObsFileDb and it will be difficult to load metadata without the ObsDb and DetDb.

obs_colon_tags

A list of strings. The strings in this list must refer to columns from the DetDb. When a string appears in this list, then the values appearing in that column of the DetDb may be used to modify an obs_id when requestion TOD data. For example, suppose DetDb has a column ‘band’ with values [‘f027’, ‘f039’, …]. Suppose that 'obs201230' is an observation ID for an array that has detectors in bands ‘f027’ and ‘f039’. Then passing 'obs201230:f027' to Context.get_obs will read and return only the timestream and metadata for the ‘f027’ detectors.

obs_loader_type

A string, giving the name of a loader function that should be used to load the TOD. The functions are registered in the module variable sotodlib.core.OBSLOADER_REGISTRY; also see sotodlib.core.context.obsloader_template().

metadata

A list of metadata specs. Each metadata spec is a dict with the schema as described by metadata.MetadataSpec.

context_hooks

A string that identifies a set of functions to hook into particular parts of the Context system. See uses of _call_hook() in the Context code, for any implemented hook function names and their signatures. To use this feature, an imported module must register the hook set in Context.hook_sets (a dict), e.g. Context.hook_sets["my_sim_hooks"] = {"on-context-ready": ...}, and then also set context_hooks: "my_sim_hooks" in the context.yaml.

Context system APIs

Context

class sotodlib.core.Context(filename=None, site_file=None, user_file=None, data=None, load_list='all')[source]
__init__(filename=None, site_file=None, user_file=None, data=None, load_list='all')[source]

Construct a Context object. Note this is an ordereddict with a few attributes added on.

Parameters:
  • filename (str) – Path to the dataset context file. If None, that’s fine.

  • site_file (str) – Path to the site file. If None, then the value of SOTODLIB_SITECONFIG environment variable is used; unless that’s unset in which case the file site.yaml in the current directory will be used.

  • user_file (str) – Path to the user file. If None, then the value of SOTODLIB_USERCONFIG environment variable is used; unless that’s unset in which case the file ~/.sotodlib.yaml will be used.

  • data (dict or None) – Optional dict of context data to merge in, after loading the site, user and main context files. Note the data are merged in with the usual rules (so items in data[‘tags’] will me merged into self[‘tags’].)

  • load_list (str or list) – A list of databases to load; some combination of ‘obsdb’, ‘detdb’, ‘obsfiledb’, or the string ‘all’ to load all of them (default).

reload(load_list='all')[source]

Load (or reload) certain databases associated with this dataset. (Note we don’t load any per-observation metadata here.)

get_obs(obs_id=None, dets=None, samples=None, filename=None, detsets=None, meta=None, ignore_missing=None, on_missing=None, free_tags=None, no_signal=None, loader_type=None)[source]

Load TOD and supporting metadata for some observation.

Most arguments to this function are also accepted by (and in fact passed directly to) get_meta(), but are documented here.

Parameters:
  • obs_id (multiple) – The observation to load (see Notes).

  • dets (list, array, dict or ResultSet) – The detectors to read. If None, all dets will be read.

  • samples (tuple of ints) – The start and stop sample indices. If None, read all samples. (Note that some loader functions might not support this argument.)

  • filename (str) – The path to a file to load, instead of using obs_id. It is still required that this file appear in the obsfiledb, but this shortcut will automatically determine the obs_id and the detector and sample range selections that correspond to this single file.

  • detsets (list, array) – The detsets to read (with None equivalent to requesting all detsets).

  • meta (AxisManager) – An AxisManager returned by get_meta (though possibly with additional axis restrictions applied) to use as a starting point for detector selection and sample range. (This will eventually be passed back to get_meta in the meta= argument, to fill in any missing metadata fields.)

  • free_tags (list) – Strings to match against the obs_colon_tags fields for detector restrictions.

  • ignore_missing (bool) – If True, don’t fail when a metadata item can’t be loaded, just try to proceed without it.

  • on_missing (dict) – If a metadata entry has a label that matches a key in this dict, the corresponding value in this dict will override the on_missing setting from the metadata entry.

  • no_signal (bool) – If True, the .signal will be set to None. This is a way to get the axes and pointing info without the (large) TOD blob. Not all loaders may support this.

  • loader_type (str) – Name of the registered TOD loader function to use (this will override whatever is specified in context.yaml).

Notes

It is acceptable to pass the obs_id argument by position (first), but all other arguments should be passed by keyword.

The obs_id can be any of the following:

  • a string – this is interpreted as the literal obs_id as used in the ObsDb and ObsFileDb. Note however that this string may include “free tags” (see below).

  • a dict – this is understood to be an ObsDb record, and the value under key ‘obs_id’ will be used as the obs_id (the other items will be ignored).

  • an AxisManager – this is a short-hand for passing an object through meta=… . I.e., get_obs(obs_is=axisman) is treated the same way as get_obs(obs_id=None, meta=axisman).

Detector subselection is achieved through the dets argument. If this is a dict, the keys must all be fields appearing in det_info. Typically det_info will include at least readout_id and detset (this is the indexing information from ObsFileDb). Some examples are:

dets={'readout_id': ['det_00', 'det_01']}
dets={'detset': 'wafer21'}
dets={'band': ['f090']}
dets={'detset': ['wafer21', 'wafer22'], 'band': ['f150']}

Each value in dets can be a single item, or a list or numpy array of items. The keys may include an optional ‘dets:’ prefix.

If dets is passed as a list or numpy array, that is equivalent to passing that value in through a dict with key ‘readout_id’; e.g.:

dets=['det_00', 'det_01']

You can instead pass a “det_info” ResultSet directly into the dets argument; that is equivalent to passing dets=det_info[‘readout_id’]. This is to accomodate the following sort of pattern:

det_info = context.get_det_info(obs_id)
det_info = det_info.subset(rows=(det_info['band'] == 'f090'))
tod = context.get_obs(obs_id, dets=det_info)

The sample range to load is determined by the samples argument. Use Python start/stop indexing; for example samples=(0, -2) will try to read all but the last two samples and samples=(100, None) will read all samples except the first 100.

When passing in meta, the obs_id, detector list, and sample range will be extracted from that object. It is an error to also specify obs_id, dets, samples, filename, or free_ags (but this could change).

get_meta(obs_id=None, dets=None, samples=None, filename=None, detsets=None, meta=None, free_tags=None, check=False, ignore_missing=False, on_missing=None, det_info_scan=False)[source]

Load supporting metadata for an observation and return it in an AxisManager.

The arguments shared with get_obs() (obs_id, dets, samples, filename, detsets, meta, free_tags, ignore_missing, on_missing) have the same meaning as in that function and are treated in the same way.

Parameters:
  • check (bool) – If True, run in a check mode where an attempt is made to load each metadata entry, but the results are not kept and instead the function returns a report on what entries could / could not be loaded

  • det_info_scan (bool) – If True, only process the metadata entries that explicitly modify det_info.

Returns:

AxisManager with a .dets LabelAxis and .det_info and .obs_info entries. If samples is specified, or if any metadata loads triggered its creation, then the .samps OffsetAxis is also created.

Notes

When meta is passed in, it will be used to figure out the obs_id and detector and sample selections; however a new metadata AxisManager is returned. Users should not rely on this; future improvements might modify meta in place, and try to re-use entries already present rather than loading them a second time.

get_det_info(obs_id=None, dets=None, samples=None, filename=None, detsets=None, meta=None, free_tags=None, on_missing=None)[source]

Pass all arguments to get_meta(det_info_scan=True)(), and then return only the det_info, as a ResultSet.

obsloader

Data formats are abstracted in the Context system, and “obsloader” functions provide the implementations to load data for a particular storage format. The API is documented in the obsloader_template function:

sotodlib.core.context.obsloader_template(db, obs_id, dets=None, prefix=None, samples=None, no_signal=None, **kwargs)[source]

This function is here to document the API for “obsloader” functions used by the Context system. “obsloader” functions are used to load time-ordered detector data (rather than supporting metadata) from file archives, and return an AxisManager.

Parameters:
  • db (ObsFileDB) – The database supporting the data files.

  • obs_id (str) – The obs_id (as recognized by ObsFileDb).

  • dets (list of str) – The dets to load. If None, all dets are loaded. If an empty list, ancillary data for the observation is still loaded.

  • samples (tuple) – The (start, end) indices of samples which should be loaded. If start is None, 0 is used. If end is None, sample_count is used. Passing None is equivalent to passing (None, None).

  • prefix (str) – The root address of the data files, if not already known to the ObsFileDb. (This is passed through to ObsFileDb prefix= argument.)

  • no_signal (bool) – If True, loader should avoid reading signal data (if possible) and should set .signal=None in the output. Passing None is equivalent to passing False.

Notes

This interface is subject to further extension. When possible such extensions should take the form of optional arguments, whose default value is None and which are not activated except when needed. This permits existing loaders to future-proof themselves by including **kwargs in the function signature but raising an exception if kwargs contains anything strange. See the body of this example function for template code to reject unexpected kwargs.

Returns:

An AxisManager with the data.

Metadata

Background

The “metadata” we are talking about here consists of things like detector pointing offsets, calibration factors, detector time constants, and so on. It may also include more complicated or voluminous data, such as per-sample flags, or signal modes. The purpose of the “metadata system” in sotodlib is to help identify, load, and correctly label such supporting ancillary data for a particular observation and set of detectors.

Storing metadata information on disk requires both a Metadata Archive, which is a set of files containing the actual data, and a Metadata Index, represented by the ManifestDb class, which encodes instructions for associating metadata information from the archive with particular observations and detector sets.

When the Context system processes a metadata entry, it will make use of the DetDb and ObsDb in its interactions with the ManifestDb. Ultimately the system will load numbers and associate with them with detectors in a particular observation. But a Metadata Archive does not need to have separate records of each number, for each detector and each observation. When appropriate, the data could be stored so that there is a single value for each detector, applicable for an entire season. Or there could be different calibration numbers for each wafer in an array, that change depending on the observation ID. As long as the ObsDb and DetDb know about the parameters being used to index things in the Metadata Archive (e.g., the timestamp of observations and the wafer for each detector), the system can support resolving the metadata request, loading the results, and broadcasting them to their intended targets.

Examples

These examples demonstrate the creation of a metadata archive and index, and adding an entry to the context.yaml metadata list.

Example 1

Let’s say we want to build an HDF5 database with a number thing per detector per observation:

from sotodlib.core import Context, metadata
import sotodlib.io.metadata as io_meta

context = Context('context_file_no_thing.yaml')
obs_rs = context.obsdb.query()
h5_file = 'thing.h5'

for i in range(len(obs_rs)):
    aman = context.get_obs(obs_rs[i]['obs_id'])
    things = calculate_thing(aman)
    thing_rs = metadata.ResultSet(keys=['dets:readout_id', 'thing'])
    for d, det in enumerate(aman.dets.vals):
        thing_rs.rows.append((det, things[d]))
    io_meta.write_dataset(thing_rs, h5_file, f'thing_for_{obs_id}')

Once we’ve built the lower level HDF5 file we need to add it to a metadata index:

scheme = metadata.ManifestScheme()
scheme.add_exact_match('obs:obs_id')
scheme.add_data_field('dataset')

db = metadata.ManifestDb(scheme=scheme)
for i in range(len(obs_rs)):
    obs_id = obs_rs[i]['obs_id']
    db.add_entry({'obs:obs_id': obs_id,
                  'dataset': f'thing_for_{obs_id}',},
                   filename=h5_file)

db.to_file('thing_db.gz')

Then have a new context file that includes:

metadata:
    - db: 'thing_db.gz'
      unpack: 'thing'

Using that context file:

context = Context('context_file_with_thing.yaml')
aman = context.get_obs(your_favorite_obs_id)

will return an AxisManager that includes aman.thing for that specific observation.

Example 2

In this example, we loop over observations found in an ObsDb and create an AxisManager for each one that contains new, interesting supporting data. The AxisManager is written to HDF5, and the ManifestDb is updated with indexing information so the metadata system can find the right dataset automatically.

Here is the script to generate the HDF5 and ManifestDb on disk:

from sotodlib import core

# We will create an entry for every obs found in this context.
ctx = core.Context('context/context-basic.yaml')

# Set up our scheme -- one result per obs_id, to be looked up in an
# archive of HDF5 files at address stored in dataset.
scheme = core.metadata.ManifestScheme()
scheme.add_exact_match('obs:obs_id')
scheme.add_data_field('dataset')
man_db = core.metadata.ManifestDb(scheme=scheme)

# Path for the ManifestDb
man_db_filename = 'my_new_info.sqlite'

# Use a single HDF5 file for now.
output_filename = 'my_new_info.h5'

# Loop over obs.
obs = ctx.obsdb.get()
for obs_id in obs['obs_id']:
    print(obs_id)

    # Load the observation, without signal, so we can see the samps
    # and dets axes, timestamps, etc.
    tod = ctx.get_obs(obs_id, no_signal=True)

    # Create an AxisManager using tod's axes.  (Note that the dets
    # axis is required for compatibility with the metadata system,
    # even if you're not going to use it.)
    new_info = core.AxisManager(tod.dets, tod.samps)

    # Add some stuff to it
    new_info.wrap_new('my_special_vector', shape=('samps', ))

    # Save the info to HDF5
    h5_address = obs_id
    new_info.save(output_filename, h5_address, overwrite=True)

    # Add an entry to the database
    man_db.add_entry({'obs:obs_id': obs_id, 'dataset': h5_address},
                      filename=output_filename)

# Commit the ManifestDb to file.
man_db.to_file(man_db_filename)

The new context.yaml file, that includes this new metadata, would have a metadata list that includes:

metadata:
  - db: 'my_new_info.sqlite'
    unpack: 'new_info'

If you want to load the single vector my_special_vector into the top level of the AxisManager, under name special, use this syntax:

metadata:
  - db: 'my_new_info.sqlite'
    unpack: 'special&my_special_vector'

Example 3

In this example we store a calibration table, with a separate number for each passband value.

from sotodlib import core
import sotodlib.io.metadata as io_meta

# This cal result has one number per wafer x passband.
data = [
    ('ws0', 'f090', 1.02),
    ('ws1', 'f090', 1.00),
    ('ws2', 'f090', 1.08),
    ('ws0', 'f150', 0.95),
    ('ws1', 'f150', 0.98),
    ('ws2', 'f150', 0.96),
    ('ws0', 'NC',   1.00),
    ('ws1', 'NC',   1.00),
    ('ws2', 'NC',   1.00),
]

# Write to HDF5
rs = core.metadata.ResultSet(
    keys=['dets:wafer_slot', 'dets:bandpass', 'cal'])
rs.rows = data
io_meta.write_dataset(rs, 'db.h5', 'bandpass_cal')

# Record in ManifestDb.
db = core.metadata.ManifestScheme(). \
    add_data_field('dataset'). \
    new_db()
db.add_entry({'dataset': 'bandpass_cal'}, filename='db.h5')
db.to_file('db.sqlite')

Example 4

This example uses an AxisManager to store a generic data structure that is not tied to specific detectors (and also does not refer to the “samps” axis). It also uses obs:timestamp to index the results.

from sotodlib import core

# Model data (t0, t1, scale, offset)
data = [
    (1700000000., 1800000000., 1.00, 4.50),
    (1800000000., 1900000000., 0.98, 4.42),
]

# Get a new database ready
db = core.metadata.ManifestScheme(). \
    add_range_match('obs:timestamp'). \
    add_data_field('dataset'). \
    new_db()

# Store each dataset to HDF5 and add record to ManifestDb.
filename = 'db.h5'
for dset, (t0, t1, scale, offset) in enumerate(data):
    group = f'encoder_model_{dset}'
    output = core.AxisManager()
    output.wrap('model_version', 'v1')
    output.wrap('scale', scale)
    output.wrap('offset', offset)
    output.save(filename, group=group)
    db.add_entry({'obs:timestamp': (t0, t1), 'dataset': group},
                 filename=filename)

# Save db.
db.to_file('db.sqlite')

Tips

If these examples are almost, but not quite, what you need, consider the following:

  • You can use multiple HDF5 files in your Metadata Archive – the filename is a parameter to db.add_entry. That helps to keep your HDF5 files a manageable size, and is good practice in cases where there are regular (e.g. daily or hourly) updates to an archive.

  • You can use an archive format other than HDF5 (if you must), see the example in Custom Archive Formats.

  • The entries in the Index do not have to be per-obs_id. You can associate results to ranges of time or to other fields in the ObsDb. See examples in Metadata Indexes.

  • The entries in the Archive do not need to be per-detector. You can specify results for a whole group of detectors, if that group is enumerated in the DetDb (or added to det_info using other metadata). For example, if DetDb contains a column passband, the dataset could contain columns dets:passband and cal and simply report one calibration number for each frequency band. (On load, the Context Metadata system will automatically broadcast the cal number so that it has shape (n_dets,) in the fully populated AxisManager.)

  • You can store all the results (i.e., results for multiple obs_id) in a single HDF5 dataset. This is not usually a good idea if your results are per-detector, per-observation… the dataset will be huge, and not easy to update incrementally. But for smaller things (one or two numbers per observation, as in the dets:passband example above) it can be convenient. Doing this requires including obs:obs_id (or some other ObsDb column) in the dataset.

Metadata Archives

A Metadata Archive is a set of files that hold metadata to be accessed by the Context/Metadata system. The metadata system is designed to be flexible with respect to how such archives are structured, at least in terms of the numbers and formats of the files.

A Metadata Instance is a set of numerical and string data, stored at a particular address within a Metadata Archive. The Instance will typically have information organized in a tabular form, equivalent to a grid where the columns have names and each row contains a set of coherent data. Some of the columns describe the detectors or observations to which the row applies. The other columns provide numerical or string information that is to be assocaited with those detectors or observations.

For example, consider a table with columns called dets:id and cal. The entry for dets:id in each row specifies what detector ID the number in the cal column of that row should be applied to. The indexing information in dets:id is called instrinsic indexing information, because it is carried within the Metadata Instance. In contrast, extrinsic indexing information is specified elsewhere. Usualy, extrinsic indexing information is provided through the Metadata Index (ManifestDb).

One possible form for an archive is a set of HDF5 files, where simple tabular data are stored in datasets. This type of archive is used for the reference implementation of the Metadata system interface, so we will describe support for that first and then later go into using other formats.

ResultSet over HDF5

The ResultSet is a container for tabular data. Functions are available in sotodlib.io.metadata for reading and writing ResultSet data to HDF5 datasets.

Here’s an example using h5py that creates a compatible dataset, and writes it to an HDF5 file:

import h5py
import numpy as np

obs_id = 'obs123456'
timeconst = np.array([('obs123456', 0.001)], dtype=[('obs:obs_id', 'S20'),
                                                    ('timeconst', float)])
with h5py.File('test.h5', 'w') as fout:
    fout.create_dataset('timeconst_for_obs123456', data=timeconst)

Here’s the equivalent operation, accomplished using ResultSet and write_dataset:

from sotodlib.core import metadata
from sotodlib.io.metadata import write_dataset

timeconst = metadata.ResultSet(keys=['obs:obs_id', 'timeconst'])
timeconst.rows.append(('obs123456', 0.001))

write_dataset(timeconst, 'test2.h5', 'timeconst_for_obs123456', overwrite=True)

The advantages of using write_dataset instead of h5py primitives are:

  • You can pass it a ResultSet directly and it will handle creating the right types.

  • Passing overwrite=True will handle the removal of any existing entries at the target path.

To inspect datasets in a HDF5 file, you can load them using h5py primitives, or with read_dataset:

from sotodlib.io.metadata import read_dataset

timeconst = read_dataset('test2.h5', 'timeconst_for_obs123456')

The returned object looks like this:

>>> print(timeconst)
ResultSet<[obs:obs_id,timeconst], 1 rows>

The metadata handling code does not use read_dataset. Instead it uses ResultSetHdfLoader, which has some optimizations for loading batches of metadata from HDF5 files and datasets, and will forcefully reconcile any columns prefixed by obs: or dets: against the provided request (using detdb and obsdb, potentially). Loading the time constants for obs123456 is done like this:

from sotodlib.io.metadata import ResultSetHdfLoader

loader = ResultSetHdfLoader()

request = {'filename': 'test2.h5',
           'dataset': 'timeconst_for_obs123456',
           'obs:obs_id': 'obs123456'}

timeconst = loader.from_loadspec(request)

The resulting object looks like this:

>>> print(timeconst)
ResultSet<[timeconst], 1 rows>

Note the obs:obs_id column is gone – it was taken as index information, and matched against the obs:obs_id in the request.

Custom Archive Formats

HDF5 is cool but sometimes you need or want to use a different storage system. Setting up a custom loader function involves the following:

  • A loader class that can read the metadata from that storage system, respecting the request API.

  • A module, containg the loader class, and also the ability to register the loader class with sotodlib, under a particular laoder name.

  • A ManifestDb data field called loader, with the value set to the loader name.

Here’s a sketchy example. We start by defining a loader class, that will read a number from a text file:

from sotodlib.io import metadata
from sotodlib.core.metadata import ResultSet, SuperLoader, LoaderInterface

class TextLoader(LoaderInterface):
    def from_loadspec(self, load_params):
        with open(load_params['filename']) as fin:
            the_answer = float(fin.read())
        rs = ResultSet(keys=['answer'], [(the_answer, )])

SuperLoader.register_metadata('text_loader', TextLoader)

Let’s suppose that code (including the SuperLoader business) is in a module called othertel.textloader. To get this code to run whenever we’re working with a certain dataset, add it to the imports list in the context.yaml:

# Standard i/o import, and TextLoader for othertel.
imports:
  - sotodlib.io.metadata
  - othertel.textloader

Now for the ManifestDb:

scheme = metadata.ManifestScheme()
scheme.add_exact_match('obs:obs_id')
scheme.add_data_field('loader')

db = metadata.ManifestDb(scheme=scheme)
db.add_entry({'obs:obs_id: 'obs12345',
              'loader': 'text_loader'},
             filename='obs12345_timeconst.txt')
db.add_entry({'obs:obs_id: 'obs12600',
              'loader': 'text_loader'},
             filename='obs12600_timeconst.txt')

Now if a metadata request is made for obs12345, for example, a single number will be loaded from obs12345_timeconst.txt.

Note the thing returned by TextLoader.from_loadspec is a ResultSet. Presently the only types you can return from a loader class function are ResultSet and AxisManager.

Metadata Indexes

In the last example above, a request dictionary was passed to ResultSetLoaderHdf, and provided instructions for locating a particular result. Such request dictionaries will normally be generated by a ManifestDb object, which is connected to an sqlite3 database that provides a means for converting high-level requests for metadata into specific request dictionaries.

The database behind a ManifestDb has 2 main tables. One of them is a table with columns for Index Criteria and Endpoint Data. The Index Criteria columns are intended to be matched against observation metadata, such as the obs_id or the timestamp of the observation. Endpoint Data contain a filename and other instructions required to locate and load the data, as well as additional restrictions to put on the result.

Please see the class documentation for ManifestDb and ManifestScheme. The remainder of this section demonstrates some basic usage patterns.

Examples

Example 1: Observation ID

The simplest Metadata index will translate an obs_id to a particular dataset in an HDF file. The ManifestScheme for this case is constructed as follows:

from sotodlib.core import metadata
scheme = metadata.ManifestScheme()
scheme.add_exact_match('obs:obs_id')
scheme.add_data_field('dataset')

Then we can instantiate a ManifestDb using this scheme, add some data rows, and write the database (including the scheme) to disk:

db = metadata.ManifestDb(scheme=scheme)
db.add_entry({'obs:obs_id': 'obs123456', 'dataset': 'timeconst_for_obs123456'},
             filename='test2.h5')
db.add_entry({'obs:obs_id': 'obs123500', 'dataset': 'timeconst_for_obs123500'},
             filename='test2.h5')
db.add_entry({'obs:obs_id': 'obs123611', 'dataset': 'timeconst_for_obs123611'},
             filename='test2.h5')
db.add_entry({'obs:obs_id': 'obs123787', 'dataset': 'timeconst_for_obs123787'},
             filename='test2.h5')
db.to_file('timeconst.gz')
Example 2: Inspecting and modifying the index

Starting from the previous example, suppose we were updating the Index in a cronjob and needed to first check whether we had already entered some entry. We can use ManifestDb.inspect() to retrieve records, matching on any fields (they don’t have to be index fields). Here’s a quick set of examples:

# Do we have any items in the file called "test2.h5"?
entries = db.inspect({'filename': 'test2.h5'})
if len(entries) > 0:
  # yes, we do ...

# Have we already added the item for obs_id='obs123787'?
entries = db.inspect({'obs:obs_id': 'obs123787'})
if len(entries) == 0:
  # no, so add it ...

Entries retrieved using inspect are dicts and contain an ‘_id’ element that allows you to modify or delte those records from the Index, using ManifestDb.update_entry() and ManifestDb.remove_entry(). For example:

# Delete all entries that refer to test2.h5.
for entry in db.inspect({'filename': 'test2.h5'}):
  db.remove_entry(entry)

# Change the spelling of 'timeconst' in the 'dataset' field of all records.
for entry in db.inspect({}):
  entry['dataset'] = entry['dataset'].replace('timeconst', 'TimECOnSt')
  # currently it's not possible to change the filename, so don't mention it...
  del entry['filename']
  db.update_entry(entry)
Example 3: Timestamp

Another common use case is to map to a result based on an observation’s timestamp instead of obs_id. The standardized key for timestamp is obs:timestamp, and we include it in the scheme with add_range_match instead of add_exact_match:

scheme = metadata.ManifestScheme()
scheme.add_range_match('obs:obs_timestamp')
scheme.add_data_field('dataset')

db = metadata.ManifestDb(scheme=scheme)
db.add_entry({'obs:timestamp': (123400, 123600),
              'dataset': 'timeconst_for_early_times'},
              filename='test2.h5')
db.add_entry({'obs:timestamp': (123600, 123800),
              'dataset': 'timeconst_for_late_times'},
              filename='test2.h5')
db.to_file('timeconst_by_timestamp.gz')

In the this case, when we add entries to the ManifestDb, we pass a tuple of timestamps (lower inclusive limit, higher non-inclusive limit) for the key obs:timestamp.

Example 4: Other observation selectors

Other fields from ObsDb can be used to build the Metadata Index. While timestamp or obs_id are quite general, a more compact and direct association can be made if ObsDb contains a field that is more directly connected to the metadata.

For example, suppose there was an intermittent problem with a subset of the detectors that requires us to discard those data from analysis. The problem occurred randomly, but it could be identified and each observation could be classified as either having that problem or not. We decide to eliminate those bad detectors by applying a calibration factor of 0 to the data.

We create an HDF5 file called bad_det_issue.h5 with two datasets:

  • cal_all_ok: has columns dets:name (listing all detectors) and cal, where cal is all 1s.

  • cal_mask_bad: same but with cal=0 for the bad detectors.

We update the ObsDb we are using to include a column bad_det_issue, and for each observation we set it to value 0 (if the problem is not seen in that observation) or 1 (if it is).

We build the Metadata Index to select the right dataset from bad_det_issue.h5, depending on the value of bad_det_issue in the ObsDb:

scheme = metadata.ManifestScheme()
scheme.add_exact_match('obs:bad_det_issue')
scheme.add_data_field('dataset')

db = metadata.ManifestDb(scheme=scheme)
db.add_entry({'obs:bad_det_issue': 0,
              'dataset': 'cal_all_ok'},
              filename='bad_det_issue.h5')
db.add_entry({'obs:bad_det_issue': 1,
              'dataset': 'cal_mask_bad'},
              filename='bad_det_issue.h5')
db.to_file('cal_bad_det_issue.gz')

The context.yaml metadata entry would probably look like this:

metadata:
  ...
  - db: '{metadata_lib}/cal_bad_det_issue.gz'
    unpack: 'cal_remove_bad&cal'
  ...

ManifestDb reference

The class documentation of ManifestDb should appear below.

class sotodlib.core.metadata.ManifestDb(map_file=None, scheme=None)[source]

Expose a map from Index Data to Endpoint Data, including a filename.

__init__(map_file=None, scheme=None)[source]

Instantiate a database. If map_file is provided, the database will be connected to the indicated sqlite file on disk, and any changes made to this object be written back to the file.

If scheme is None, the scheme will be loaded from the database; pass scheme=False to prevent that and leave the db uninitialized.

copy(map_file=None, overwrite=False)[source]

Duplicate the current database into a new database object, and return it. If map_file is specified, the new database will be connected to that sqlite file on disk. Note that a quick way of writing a Db to disk to call copy(map_file=…) and then simply discard the returned object.

to_file(filename, overwrite=True, fmt=None)[source]

Write the present database to the indicated filename.

Parameters:
  • filename (str) – the path to the output file.

  • overwrite (bool) – whether an existing file should be overwritten.

  • fmt (str) – ‘sqlite’, ‘dump’, or ‘gz’. Defaults to ‘sqlite’ unless the filename ends with ‘.gz’, in which it is ‘gz’.

classmethod from_file(filename, fmt=None, force_new_db=True)[source]

Instantiate an ManifestDb and return it, with the data copied in from the specified file.

Parameters:
  • filename (str) – path to the file.

  • fmt (str) – format of the input; see to_file for details.

  • force_new_db (bool) – Used if connecting to an sqlite database. If True the database is copied into memory and if False returns a read-only connection to the database without reading it into memory

Returns:

ManifestDb with an sqlite3 connection that is mapped to memory.

Notes

Note that if you want a persistent connection to the file, you should instead pass the filename to the ManifestDb constructor map_file argument.

classmethod readonly(filename)[source]

Instantiate an ManifestDb connected to an sqlite database on disk, and return it. The database remains mapped to disk, in readonly mode.

Parameters:

filename (str) – path to the file.

Returns:

ManifestDb.

match(params, multi=False, prefix=None)[source]

Given Index Data, return Endpoint Data.

Parameters:
  • params (dict) – Index Data.

  • multi (bool) – Whether more than one result may be returned.

  • prefix (str or None) – If set, it will be os.path.join-ed to the filename from the db.

Returns:

A dict of Endpoint Data, or None if no match was found. If multi=True then a list is returned, which could have 0, 1, or more items.

inspect(params={}, strict=True, prefix=None)[source]

Given (partial) Index Data and Endpoint Data, find and return the complete matching records.

Parameters:
  • params (dict) – any mix of Index Data and Endpoint Data.

  • strict (bool) – if True, a ValueError will be raised if params contains any keys that aren’t recognized as Index or Endpoint data.

  • prefix (str or None) – As in .match().

Returns:

A list of results matching the query. Each result in the list is a dict, containing complete entry data. A special entry, ‘_id’, is the database row id and can be used to update or remove specific entries.

add_entry(params, filename=None, create=True, commit=True, replace=False)[source]

Add an entry to the map table.

Parameters:
  • params – a dict of values for the Index Data columns. In the case of ‘range’ columns, a pair of values must be provided. Endpoint Data, other than the filename, should also be included here.

  • filename – the filename to associate with matching Index Data.

  • create – if False, do not create new entry in the file table (and fail if entry for filename does not already exist).

  • commit – if False, do not commit changes to the database (for batch use).

  • replace – if True, do not raise an error if the index data matches a row of the table already; instead just update the record.

Notes

The uniqueness check in the database will only prevent (or replace) identical index entries. Other inconsistent states, such as overlapping time ranges that would both match some single timestamp, are not caught here.

update_entry(params, filename=None, commit=True)[source]

Update an existing entry.

Parameters:

params – Index data to change. This must include key ‘_id’, with the value corresponding to an existing row in the table.

Notes

Only columns expressly included in params will be updated. The params can include ‘filename’, in which case a new value is set.

remove_entry(_id, commit=True)[source]

Remove the entry identified by row id _id.

If _id is a dict, _id[‘_id’] is used. Entries returned by .inspect() should have _id populated in this way, and thus can be passed directly into this function.

get_entries(fields)[source]

Return list of all entry names in database that are in the listed fields

Parameters:

fields (list of strings) – should correspond to columns in map table made through ManifestScheme.add_data_field( field_name )

Return type:

ResultSet with keys equal to field names

validate()[source]

Checks that the database is following internal rules. Specifically…

Raises SchemaError in the first case, IntervalError in the second.

ManifestScheme reference

The class documentation of ManifestScheme should appear below.

class sotodlib.core.metadata.ManifestScheme[source]
__init__()[source]
new_db(**kwargs)[source]

Use this scheme to instantiate a ManifestDb, and return it. Aall kwargs are passed to ManifestDb constructor.

add_exact_match(name, dtype='numeric')[source]

Add a field to the scheme, that must be matched exactly.

add_range_match(name, purpose='in', dtype='numeric')[source]

Add a field to the scheme, that represents a range of values to be matched by a single input value.

add_data_field(name, dtype='numeric')[source]

Add a field to the scheme that is returned along with the matched filename.

as_resultset()[source]

Get the scheme structure as a ResultSet. This is a safer alternative to inspecting .cols directly.

classmethod from_database(conn, table_name='input_scheme')[source]

Decode a ManifestScheme from the provided sqlite database connection.

get_match_query(params, partial=False, strict=False)[source]

Get sql query fragments for this ManifestDb.

Parameters:
  • params – a dict of values to match against.

  • partial – if True, then operate in “inspection” mode (see notes).

  • strict – if True, then reject any requests that include entries in params that are not known to the schema.

Returns:

(where_string, values_tuple, ret_cols)

Notes

The normal mode (partial=False) requires that every “in” column in the scheme has a key=value pair in the params dict, and the ret_cols are the “out” columns. In inspection mode (partial=True), then any column can be matched against the params, and the complete row data of all matching rows is returned.

get_insertion_query(params)[source]

Get sql query fragments for inserting a new entry with the provided params.

Returns:

(fields, values) where fields is a string with the field names (comma-delimited) and values is a tuple of values.

get_update_query(params)[source]

Get sql query fragments for updating an entry.

Returns:

(setstring, values) where setstring is of the form “A=?,…,Z=?” and values is the corresponding tuple of values.

get_required_params()[source]

Returns a list of parameter names that are required for matching.

Metadata Request Processing

Metadata loading is triggered automatically when Context.get_obs() (or get_meta) is called. The parameters to get_obs define an observation of interest, through an obs_id, as well as (potentially) a limited set of detectors of interest. Processing the metadata request may require the code to refer to the ObsDb for more information about the specified obs_id, and to the DetDb or det_info dataset for more information about the detectors.

Steps in metadata request processing

For each item in the context.yaml metadata entry, a series of steps are performed:

  1. Read ManifestDb.

    The db file specified in the metadata entry is loaded into memory.

  2. Promote Request.

    The metadata request is likely to include information about what observation and what detectors are of interest. But the ManifestDb may desire a slightly different form for this information. For example, an obs_id might be provided in the request, but the ManifestDb might index its results using timestamps (obs:timestamp). In the Promote Request step, the ManifestDb is interrogated for what obs and dets fields it requires as Index Data. If those fields are not already in the request, then the request is augmented to include them; this typically requires interaction with the ObsDb and/or DetDb and det_info. The augmented request is the result of the Promotion Step.

  3. Get Endpoints.

    Once the augmented request is computed, the ManifestDb can be queried to see what endpoints match the request. The ManifestDb will return a list of Endpoint results. Each Endpoint result describes the location of some metadata (i.e. a filename and possibly some address within that file), as well as any limitations on the applicability of that data (e.g. it may specify that although the results include values for 90 GHz and 150 GHz detectors, only the results for 150 GHz detectors should be kept). The metadata results are not yet loaded.

  4. Read Metadata.

    Each Endpoint result is processed and the relevant files accessed to load the specified data products. The data within are trimmed to only include items that were actually requested by the Index data (for example, although results for 90 GHz and 150 GHz detectors are included in an HDF5 dataset, the data may be trimmed to only include the 150 GHz detector results). This will yield one metadata result per Endpoint item.

  5. Combine Metadata.

    The individual results from each Endpoint are combined into a single object, of the same type.

  6. Wrap Metadata.

    The metadata object is converted to an AxisManager, and wrapped as specified by the user (this could include storing the entire object as a field; or it could mean extracting and renaming a single field from the result, for example).

Incomplete or missing metadata

Metadata is considered “incomplete” for a particular request, if any of the following are true:

  • No endpoints were returned from the query to the ManifestDb; i.e. the database has no records of what files contain data pertinent to the request.

  • The results loaded from the metadata archive do not cover the requested sample range or detector set completely (this includes the case where there is zero overlap).

When a metadata result is incomplete, the metadata loader code will take one of the following actions:

  • trim: take the limited metadata result, and trim the merged data object down so it contains only the detectors and samples that are found in the metadata result.

  • skip: discard this metadata result; do not include it in the merged data object.

  • fail: raise an error.

In the metadata list in context.yaml, each metadata entry can declare a preferred action using the on_missing key. For example:

metadata:
  - label: optional_relcal
    on_missing: skip
    db: "relcal1.sqlite"
    unpack: "optional_relcal&cal"
  - label: critical_relcal
    on_missing: fail
    db: "relcal2.sqlite"
    unpack: "critical_relcal&cal"

The default behavior is trim. The behavior specified in context.yaml can be overridden through the call to Context.get_obs() or Context.get_meta(); just use the on_missing argument to specify what should happen for metadata with a specific label. Examples:

# Suppose this fails because relcal2.sqlite does not cover the obs ...
tod = ctx.get_obs(obs_id)

# This will instead ignore the result, and not populate critical_relcal.
tod = ctx.get_obs(obs_id, on_missing={'critical_relcal': 'skip'})

Note that “skipping” can have confusing downstream consequences, such as when a det_info entry doesn’t get added and then some metadata product tries to use it as an index field.

Rules for augmenting and using det_info

The metadata loader associates metadata results to individual detectors using fields from the observation’s det_info. For any single observation, the det_info is first initialized from:

  • The ObsFileDb; this provides the unique readout_id for each channel, as well as the detset to which each belongs.

  • If a DetDb has been specified, everything in there is copied into det_info.

From that starting point, additional fields can be loaded into the det_info; these fields can then be used to load metadata indexed in a variety of ways. For the mu-mux readout used in SO, the following additional steps will usually be performed to augment the det_info:

  • A “channel map” (a.k.a. “det map”, “det match”, …) result will be loaded, to associate a det_id with each of the loaded readout_id values (or most of them, hopefully). While the readout_id describes a specific channel of the readout hardware, the det_id corresponds to a particular optical device (e.g. a detector with known position, orientation, and passband).

  • Some “wafer info” will be loaded, containing various properties of the detectors, as designed (for example, their approximate locations; passbands; wiring information; etc.). This is a single table of data for each physical wafer, indexed naturally by det_id, but the rows here can not be associated with each readout_id until we have loaded the “channel map”.

Special metadata entries in context.yaml are used to augment the det_info. table; these are marked with det_info: true and do not have a unpack: ... field. For example:

metadata:
- ...
- db: 'more_det_info.sqlite'
  det_info: true
- ...

It is expected that the database (more_det_info.sqlite) is a standard ManifestDb, which will be queried in the usual way except that when we get to the “Wrap Metadata” step, instead the following is performed:

  • All the loaded fields are inspected, and any fields that are already found in the current det_info are used as Index fields.

  • The columns from the new metadata are merged into the active det_info, ensuring that the index field values correspond.

  • Only the rows for which the index field has the same value in the two objects are kept.

Here are a few more tips about det_info:

  • All fields in the det_info metdata should be prefixed with dets:, to signify that they are associated with the dets axis (this is similar to fields used as index fields in standard metadata.

  • The field called dets:readout_id is assumed, throughout the Context/Metdata code, to correspond to the values in the .dets axis of the TOD AxisManager.

  • By convention, dets:det_id is used for the physical detector (or pseudo-detector device) identifier. The special value “NO_MATCH” is used for cases where the det_id could not be determined.

When new det_info are merged in, any fields found in both the existing and new det_info will be used to form the association. Many-to-many matching is fully supported, meaning that a unique index (e.g. readout_id) does not need to be used in the new det_info. However, it is expected that in most cases either readout_id or det_id will be used to label det_info contributions.

Metadata System APIs

load_metadata

This function provides a way to load metadata for an AxisManager, from a ManifestDb, without having to encode the result in a context.yaml metadata entry.

sotodlib.core.metadata.load_metadata(tod, spec, unpack=False)[source]

Process a metadata entry for an AxisManager.

Parameters:
  • tod (AxisManager) – The data structure from which to source any obs_info and det_info that are needed to process the metadata specification. This

  • spec (dict) – a metadata specification, such as one might find as an element of the “metadata” list in a context.yaml file.

  • unpack (bool) – if True, and if the spec does not identify as det_info, try to unpack the result into an AxisManager and return that. This will result in broadcasting of items indexed by det_info fields into a full .dets axis.

Returns:

The loaded metadata item, which could be an AxisManager or ResultSet. In the AxisManager case, the axes have likely not been resolved against the provided tod, so the sample count and detector count / ordering may be different. (The caller can merge after the fact.)

Notes

The tod container needs to contain obs_info and det_info (including the dets axis), in order to follow any branching instructions for loading the metadata. This would normally be an AxisManager returned by Context.get_obs() or get_meta().

MetadataSpec

class sotodlib.core.metadata.MetadataSpec[source]

Container for the canonical metadata specification.

When constructed from a dict, the following attributes are set directly from the corresponding key:

db (str, ManifestDb)

The path to a ManifestDb file. For testing and other purposes, this may be passed as a ManifestDb object. Defaults to None.

det_info (bool)

If True, treat the metadata as a contribution to det_info. The metadata will be merged into the active det_info object. Defaults to False.

label (str)

A short string describing the metadata. This is used to target the entry when overriding the default on_missing behavior. Defaults to None.

loader (str)

The name of the loader class to use when loading the data. This will take precedence over what is specified in the ManifestDb, and is normally unnecessary but can be used for debugging / work-arounds. Defaults to None.

on_missing (str)

String describing how to proceed in the event that the metadata is incomplete (or missing entirely) for the target Observation. The value should one of ‘trim’, ‘skip’, or ‘fail’. Defaults to ‘trim’.

unpack (list of str)

Instructions for how to populate the destination AxisManager with fields found in this metadata item. See notes below.

load_fields (list of str or None)

List of fields to load. This may include entire child AxisManagers, or fields within them using “.” for hierarchical addressing. This is only for AxisManager metadata. Default is None, which meaning to load all fields. Wildcards are not supported.

drop_fields (list of str)

List of fields (which may contain wildcard character *) to drop prior to merging. Only processed for AxisManager metadata. (The dropping is applied after any restrictions on the loading using load_fields).

The following dict keys are deprecated, but are processed for backwards compatibility.

name (str or list of str)

(Deprecated.) This has been renamed as “unpack”, and will be copied into that attribute if unpack is not otherwise set.

Notes

In the unpack list, each must be in one of 4 possible forms, shown below, to the left of the :. The resulting assignment operation is shown to the right of the :.

'dest_name&source_name'  : dest[dest_name] = source[source_name]
'dest_name&'             : dest[dest_name] = source[dest_name]
'dest_name&*'            : dest[dest_name] = source[wildcard[0]]
'dest_name'              : dest[dest_name] = source

The first three forms cause a single field to be extracted. The 3rd form is used to extract a single field and rename it, assuming that name is the only one in source (it is an error otherwise).

The unpack list may include multiple single field extraction entries, but each source field may only be referenced once. (So for example ['a&a', 'b&b'] is valid but ['a&a, 'b&a'] is not).

The fourth form causes the entire item to be merged into the target at dest_name. This can operate alongside any number of individual field extractions.

Examples

Here is an example context.yaml metadata list, showing some common formations:

metadata:
  # assignment
  - label: assignment
    db: '{metadata_dir}/det_match/satp1_det_match_240220m/assignment.sqlite'
    det_info: true
    on_missing: fail
  # focal_plane
  - label: focal_plane
    db: '{manifestdir}/focal_plane/satp1_focal_plane_240308r1/db.sqlite'
    unpack: focal_plane
    on_missing: trim
  # hwp_angles
  - label: hwp_angles
    db: '{manifestdir}/hwp_angles/satp1_hwp_angles_240301m/hwp_angle.sqlite'
    load_fields:
    - hwp_angle_enc1
    - hwp_flags
    unpack:
    - 'hwp_angle&hwp_angle_enc1'
    - '&hwp_flags'
    on_missing: drop
  # starcam
  - label: starcam
    db: '{manifestdir}/starcam_solutions/starcam_solutions_240401m/db.sqlite'
    drop_fields: 'image_data_*'
    unpack: starcam
    on_missing: drop

Note that all entries have label and db elements. The label is unique (this is not required however). The paths for db all include {manifestdir}. This will be replaced by the value assigned to manifestdir in the tags section of the context.yaml file. Referring to particular entries, by label:

  1. The “assignment” entry declares itself as “det_info: true”. Thus, it does not have an “unpack” key. The data will unpack as a simple table and be merged into the observation’s “det_info”. Because “on_missing: fail”, it is an error if this product can not be fully reconciled against an observation without dropping detectors.

  2. The “focal_plane” entry specifies “unpack: focal_plane”, which means that the entire loaded metadata will be placed into a child AxisManager called “focal_plane”. However “on_missing: trim” means that the focal_plane result does not need to be defined for all detectors. If any are missing, then all data for those dets will be dropped from the loaded observation.

  3. The “hwp_angles” entry has a “load_fields” key, which will restrict what data are actually pulled in from the product on disk. This is used in cases where the on-disk product has a lot of data in it that is not needed. Specifying that only a small subset of the data are needed can greatly increase metadata construction time. The value for “unpack” is now a list, identifying that the loaded “hwp_flags” data can be placed directly into “hwp_flags”, while “hwp_angle_enc1” should be renamed to simply “hwp_angle”. The use of “on_missing: drop” means that if this product is not available for this observation, it is ok to simply continue on without it.

  4. The “starcam” entry uses “drop_fields” to discard certain fields from the loaded data, prior to merging it into the observation metadata AxisManager. In practice this doesn’t save much in terms of i/o cost; it’s better to use “load_fields” to explicitly include the list of things you care about. The drop_fields option is aimed at deleting fields from buggy data because they fail to concatenate properly after load.

SuperLoader

class sotodlib.core.metadata.SuperLoader(context=None, detdb=None, obsdb=None, working_dir=None)[source]
__init__(context=None, detdb=None, obsdb=None, working_dir=None)[source]

Metadata batch loader.

Parameters:
  • context (Context) – context, from which detdb and obsdb will be pulled unless they are specified explicitly.

  • detdb (DetDb) – detdb to use when resolving detector axis.

  • obsdb (ObsDb) – obsdb to use when resolving obs axis.

  • working_dir (str) – base directory for any metadata specified as relative paths. If None, but if context is not None, then the path of context.filename is used; otherwise cwd is used.

static register_metadata(name, loader_class)[source]

Globally register a metadata “Loader Class”.

Parameters:
  • name (str) – Name under which to register the loader. Metadata archives will request the loader class using this name.

  • loader_class – Metadata loader class.

load_one(spec, request, det_info)[source]

Process a single metadata entry (spec) by loading a ManifestDb and reading metadata for a particular observation. The request must be pre-augmented with all ObsDb info that might be needed. det_info is used to screen the returned data for the various index_lines.

Parameters:
  • spec (dict) – A metadata specification dict (corresponding to a metadata list entry in context.yaml), or MetadataSpec object.

  • request (dict) – A metadata request dict (stating what observation and detectors are of interest).

  • det_info (ResultSet) – Table of detector properties to use when resolving metadata that is indexed with dets:… fields.

Notes

If passing spec as a dict, see the schema described in MetadataSpec.

Any filenames in the ManifestDb that are given as relative paths will be resolved relative to the directory where the db file is found.

The request dict specifies what times and detectors are of interest. If the metadata archive is indexed by timestamp and wafer_slot, then you might pass in:

{'obs:timestamp': 1234567000.,
 'dets:wafer_slot': 'w01'}

When this function is invoked from self.load, the request dict will have been automatically “augmented” using the ObsDb. The main purpose of this is to provide obs:timestamp (and any other useful indexing fields) from ObsDb based on obs:obs_id.

The det_info object comes into play in cases where a loaded metadata result refers to some large group of detectors, but the metadata index, or the user request, expresses that the result should be limited to only a subset of those detectors. This is notated in practice by including dets:… fields in the index data in the ManifestDb, or in the request dict. Only fields already present in det_info may be included in the request dict.

Returns:

A list of tuples (unpacker, item), corresponding to each entry in spec_list. The unpacker is an Unpacker object created based on the ‘name’ field. The item is the metadata in its native format (which could be a ResultSet or AxisManager), with all restrictions specified in request already applied.

load(spec_list, request, det_info=None, free_tags=[], free_tag_fields=[], dest=None, check=False, det_info_scan=False, ignore_missing=False, on_missing=None)[source]

Loads metadata objects and processes them into a single AxisManager.

Parameters:
  • spec_list (list of dicts) – Each dict is a metadata spec, as described in load_one.

  • request (dict) – A request dict.

  • det_info (AxisManager) – Detector info table to use for reconciling ‘dets:…’ field restrictions.

  • free_tags (list of str) – Strings that restrict the detector to any detector that matches the string in any of the det_info fields listed in free_tag_fields.

  • free_tag_fields (list of str) – Fields (of the form dets:x) that can be inspected to match free_tags.

  • dest (AxisManager or None) – Destination container for the metadata (if None, a new one is created).

  • check (bool) – If True, run in check mode (see Notes).

  • det_info_scan (bool) – If True, only process entries that directly update det_info.

  • ignore_missing (bool) – If True, don’t fail when a metadata item can’t be loaded, just try to proceed without it.

  • on_missing (dict) – If a key here matches the label of a metadata entry, the value will override the on_missing entry of the metadata entry. (Each value must be “trim”, “skip” or “fail”.)

Returns:

In normal mode, an AxisManager containing the metadata (dest). In check mode, a list of tuples (spec, exception).

Notes

If check=True, this won’t store and return the loaded metadata; it will instead return a list of the same length as spec_list, with either None (if the entry loaded successful) or the Exception raised when trying to load that entry. When check=False, metadata retrieval errors will raise some kind of error. When check=True, those are caught and returned to the caller.

ResultSet

class sotodlib.core.metadata.ResultSet(keys, src=None)[source]

ResultSet is a special container for holding the results of database queries, i.e. columnar data. The repr of a ResultSet states the name of its columns, and the number of rows:

>>> print(rset)
ResultSet<[array_code,freq_code], 17094 rows>

You can access the column names in .keys:

>>> print(rset.keys)
['array_code', 'freq_code']

You can request a column by name, and a numpy array of values will be constructed for you:

>>> rset['array_code']
array(['LF1', 'LF1', 'LF1', ..., 'LF1', 'LF1', 'LF1'], dtype='<U3')

You can request a row by number, and a dict will be constructed for you:

>>> rset[10]
{'base.array_code': 'LF1', 'base.freq_code': 'f027'}

Note that the array or dict returned by indexing the ResultSet present copies of the data, not changing those objects will not update the original ResultSet.

You can also access the raw row data in .rows, which is a simple list of tuples. If you want to edit the data in a ResultSet, modify those data rows directly, or else use .asarray() to get a numpy array, modify the result, and create and a new ResultSet from that using the .from_friend constructor.

You can get a structured numpy array using:

>>> ret.asarray()
array([('LF1', 'f027'), ('LF1', 'f027'), ('LF1', 'f027'), ...,
        ('LF1', 'f027'), ('LF1', 'f027'), ('LF1', 'f027')],
      dtype=[('array_code', '<U3'), ('freq_code', '<U4')])

Slicing works along the row axis; and you can combine two results. So you could reorganize results like this, if you wanted:

>>> rset[::2] + rset[1::2]
ResultSet<[array_code,freq_code], 17094 rows>

Finally, the .distinct() method returns a ResultSet containing the distinct elements:

>>> rset.distinct()
ResultSet<[array_code,freq_code], 14 rows>
__init__(keys, src=None)[source]
keys = None

Once instantiated, a list of the names of the ResultSet columns.

rows = None

Once instantiated, a list of the raw data tuples.

classmethod from_friend(source)[source]

Return a new ResultSet populated with data from source.

If source is a ResultSet, a copy is made. If source is a numpy structured array, the ResultSet is constructed based on the dtype names and rows of source.

Otherwise, a TypeError is raised.

subset(keys=None, rows=None)[source]

Returns a copy of the object, selecting only the keys and rows specified.

Parameters:
  • keys – a list of keys to keep. None keeps all.

  • rows – a list or array of the integers representing which rows to keep. This can also be specified as an array of bools, of the same length as self.rows, to select row by row. None keeps all.

classmethod from_cursor(cursor, keys=None)[source]

Create a ResultSet using the results stored in cursor, an sqlite.Cursor object. The cursor must have be configured so that .description is populated.

asarray(simplify_keys=False, hdf_compat=False)[source]

Get a numpy structured array containing a copy of this data. The names of the fields are taken from self.keys.

Parameters:
  • simplify_keys – If True, then the keys are stripped of any prefix (such as ‘base.’). This is mostly for DetDb, where the table name can be annoying. An error is thrown if this results in duplicate field names.

  • hdf_compat – If True, then ‘U’-type columns (Unicode strings) are converted to ‘S’-type (byte strings), so it can be stored in an HDF5 dataset.

distinct()[source]

Returns a ResultSet that is a copy of the present one, with duplicates removed. The rows are sorted (according to python sort).

strip(patterns=[])[source]

For any keys that start with a string in patterns, remove that string prefix from the key. Operates in place.

to_axismanager(axis_name='dets', axis_key='dets')[source]

Build an AxisManager directly from a ResultSet, projecting all columns along a single axis. This requires no additional metadata to build

Parameters:
  • axis_name – string, name of the axis in the AxisManager

  • axis_key – string, name of the key in the ResultSet to put into the axis labels. This key will not be added to the AxisManager fields.

merge(src)[source]

Merge with src, which must have same number of rows as self. Duplicate columns are not allowed.

sotodlib.io.metadata

This module contains functions for working with ResultSet in HDF5.

Here’s the docstring for write_dataset:

sotodlib.io.metadata.write_dataset(data, filename, address, overwrite=False, mode='a')[source]

Write a metadata object to an HDF5 file as a single dataset.

Parameters:
  • data – The metadata object. Currently only ResultSet and numpy structured arrays are supported.

  • filename – The path to the HDF5 file, or an open h5py.File.

  • address – The path within the HDF5 file at which to create the dataset.

  • overwrite – If True, remove any existing group or dataset at the specified address. If False, raise a RuntimeError if the write address is already occupied.

  • mode – The mode specification used for opening the file (ignored if filename is an open file).

Here’s the docstring for read_dataset:

sotodlib.io.metadata.read_dataset(fin, dataset)[source]

Read a dataset from an HDF5 file and return it as a ResultSet.

Parameters:
  • fin – Filename or h5py.File open for reading.

  • dataset – Dataset path.

Returns:

ResultSet populated from the dataset. Note this is passed through _decode_array, so byte strings are converted to unicode.

Here’s the class documentation for ResultSetHdfLoader:

class sotodlib.io.metadata.ResultSetHdfLoader(detdb=None, obsdb=None)[source]
from_loadspec(load_params, **kwargs)[source]

Retrieve a metadata result from an HDF5 file.

Parameters:

load_params – an index dictionary (see below).

Returns a ResultSet (or, for subclasses, whatever sort of thing is returned by self._populate).

The “index dictionary”, for the present case, may contain extrinsic and intrinsic selectors (for the ‘obs’ and ‘dets’ axes); it must also contain:

  • ‘filename’: full path to an HDF5 file.

  • ‘dataset’: name of the dataset within the file.

Note that this just calls batch_from_loadspec.

batch_from_loadspec(load_params, **kwargs)[source]

Retrieves a batch of metadata results. load_params should be a list of valid index data specifications. Returns a list of objects, corresponding to the elements of load_params.

This function is relatively efficient in the case that many requests are made for data from a single file.

DetDb: Detector Information Database

The purpose of the DetDb class is to give analysts access to quasi-static detector metadata. Some good example of quasi-static detector metadata are: the name of telescope the detector lives in, the approximate central frequency of its passband, the approximate position of the detector in the focal plane. This database is not intended to carry precision results needed for mapping, such as calibration information or precise pointing and polarization angle data.

Note

DetDb is not planned for use in SO, because of the complexity of the readout_id to det_id mapping problem. The det_info system (described in the Metadata section) is used for making detector information available in the loaded TOD object. DetDb is still used for simulations and for wrapping of data from other readout systems.

Using a DetDb (Tutorial)

Loading a database into memory

To load an existing DetDb into memory, use the DetDb.from_file class method:

>>> from sotodlib.core import metadata
>>> my_db = metadata.DetDb.from_file('path/to/database.sqlite')

This function understands a few different formats; see the method documentation.

If you want an example database to play with, run this:

>>> from sotodlib.core import metadata
>>> my_db = metadata.get_example('DetDb')
Creating table base
Creating table geometry
Creating LF-type arrays...
Creating MF-type arrays...
Creating HF-type arrays...
Committing 17094 detectors...
Checking the work...
>>> my_db
<sotodlib.core.metadata.detdb.DetDb object at 0x7f691ccb4080>

The usage examples below are based on this example database.

Detectors and Properties

The typical use of DetDb involves alternating use of the dets and props functions. The dets function returns a list of detectors with certain indicated properties; the props function returns the properties of certain indicated detectors.

We can start by getting a list of all detectors in the database:

>>> det_list = my_db.dets()
>>> det_list
ResultSet<[name], 17094 rows>

The ResultSet is a simple container for tabular data. Follow the link to the class documentation for the detailed interface. Here we have a single column, giving the detector name:

>>> det_list['name']
array(['LF1_00000', 'LF1_00001', 'LF1_00002', ..., 'HF2_06501',
       'HF2_06502', 'HF2_06503'], dtype='<U9')

Similarly, we can retrieve all of the properties for all of the detectors in the database:

>>> props = my_db.props()
>>> props
ResultSet<[base.instrument,base.camera,base.array_code,
base.array_class,base.wafer_code,base.freq_code,base.det_type,
geometry.wafer_x,geometry.wafer_y,geometry.wafer_pol], 17094 rows>

The output of props() is also a ResultSet; but it has many columns. The property values for the first detector are:

>>> props[0]
{'base.instrument': 'simonsobs', 'base.camera': 'latr',
 'base.array_code': 'LF1', 'base.array_class': 'LF',
 'base.wafer_code': 'W1', 'base.freq_code': 'f027',
 'base.det_type': 'bolo', 'geometry.wafer_x': 0.0,
 'geometry.wafer_y': 0.0, 'geometry.wafer_pol': 0.0}

We can also inspect the data by column, e.g. props['base.camera']. Note that name isn’t a column here… each row corresponds to a single detector, in the order returned by my_db.dets().

Querying detectors based on properties

Suppose we want to get the names of the detectors in the (nominal) 93 GHz band. These are signified, in this example, by having the value 'f093' for the base.freq_code property. We call dets() with this specfied:

>>> f093_dets = my_db.dets(props={'base.freq_code': 'f093'})
>>> f093_dets
ResultSet<[name], 5184 rows>

The argument passed to the props= keyword, here, is a dictionary containing certain values that must be matched in order for a detector to be included in the output ResultSet. One can also pass a list of such dictionaries (in which case a detector is included if it fully matches any of the dicts in the list). One can, to similar effect, pass a ResultSet, which results in detectors being checked against each row of the ResultSet.

Similarly, we can request the properties of some sub-set of the detectors; let’s use the f093_dets list to confirm that these detectors are all in MF arrays:

>>> f093_props = my_db.props(f093_dets, props=['base.array_class'])
>>> list(f093_props.distinct())
[{'base.array_class': 'MF'}]

Note we’ve used the ResultSet.distinct() method to eliminate duplicate entries in the output from props(). If you prefer to work with unkeyed data, you can work with .rows instead of converting to a list:

>>> f093_props.distinct().rows
[('MF',)]

Grouping detectors by property

Suppose we want to loop over all detectors, but with them grouped by array name and frequency band. There are many ways to do this, but a very general approach is to generate a list of tuples representing the distinct combinations of these properties. We then loop over that list, pulling out the names of the matching detectors for each tuple of property values.

Here’s an example, which simply counts the results:

# Get the two properties, one row per detector.
>>> props = my_db.props(props=[
...   'base.array_code', 'base.freq_code'])
# Reduce to the distinct combinations (only 14 rows remain).
>>> combos = props.distinct()
# Loop over all 14 combos:
>>> for combo in combos:
...   these_dets = my_db.dets(props=combo)
...   print('Combo {} includes {} dets.'.format(combo, len(these_dets)))
...
Combo {'base.array_code': 'HF1', 'base.freq_code': 'f225'} includes 1626 dets.
Combo {'base.array_code': 'HF1', 'base.freq_code': 'f278'} includes 1626 dets.
Combo {'base.array_code': 'HF2', 'base.freq_code': 'f225'} includes 1626 dets.
# ...
Combo {'base.array_code': 'MF4', 'base.freq_code': 'f145'} includes 1296 dets.

Extracting useful detector properties

There are a couple of standard recipes for getting data out efficiently. SUppose you want to extract two verbosely-named numerical columns geometry.wafer_x and geometry.wafer_y. We want to be sure to only type those key names out once:

# Find all 'LF' detectors.
>>> LF_dets = my_db.dets(props={'base.array_class': 'LF'})
>>> LF_dets
ResultSet<[name], 222 rows>
# Get positions for those detectors.
>>> positions = my_db.props(LF_dets, props=['geometry.wafer_x',
... 'geometry.wafer_y'])
>>> x, y = numpy.transpose(positions.rows)
>>> y
array([0.  , 0.02, 0.04, 0.06, 0.08, 0.1 , 0.12, 0.14, 0.16, 0.18, 0.2 ,
       0.22, 0.24, 0.26, 0.28, 0.3 , 0.32, 0.34, 0.36, 0.38, ...])
# Now go plot stuff using x and y...
# ...

Note in the last line we’ve used numpy to transform the tabular data (in ResultSet.rows) into a simple (n,2) float array, which is then transposed to a (2,n) array, and broadcast to variables x and y. It is import to include the .rows there – a direct array conversion on positions will not give you what you want.

Inspecting a database

If you want to see a list of the properties defined in the database, just call props with an empty list of detectors. Then access the keys data member, if you want programmatic access to the list of properties:

>>> props = my_db.props([])
>>> props
ResultSet<[base.instrument,base.camera,base.array_code,
base.array_class,base.wafer_code,base.freq_code,base.det_type,
geometry.wafer_x,geometry.wafer_y,geometry.wafer_pol], 0 rows>
>>> props.keys
['base.instrument', 'base.camera', 'base.array_code',
'base.array_class', 'base.wafer_code', 'base.freq_code',
'base.det_type', 'geometry.wafer_x', 'geometry.wafer_y',
'geometry.wafer_pol']

Creating a DetDb

For an example, see source code of get_example.

Database organization

The dets Table

The database has a primary table, called dets. The dets table has only the following columns:

name

The string name of each detector.

id

The index, used internally, to enumerate detectors. Do not assume that the correspondence between index and name will be static – it may change if the underlying inputs are changed, or if a database subset is generated.

The Property Tables

Every other table in the database is a property table. Each row of the property table associates one or more (key,value) pairs with a detector for a particular range of time. A property table contains at least the following 3 columns:

det_id

The detector’s internal id, a reference to dets.id.

time0

The timestamp indicating the start of validity of the properties in the row.

time1

The timestamp indicating the end of validity of the properties in the row. The pair of timestamps time0 and time1 define a semi-open interval, including all times t such that time0 <= t < time1.

All other columns in the property table provide detector metadata. For a property table to contain valid data, the following criteria must be satisfied:

  1. The range of validity must be non-negative, i.e. time0 <= time1. Note that if time0 == time1, then the interval is empty.

  2. In any one property table, the time intervals associated with a single det_id must not overlap. Otherwise, there would be ambiguity about the value of a given property.

Internally, query code will assume that above two conditions are satisfied. Functions exist, however, to verify compliance of property tables.

DetDb auto-documentation

Auto-generated documentation should appear here.

class sotodlib.core.metadata.DetDb(map_file=None, init_db=True)[source]

Detector database. The database stores data about a set of detectors.

The dets table lists all valid detectors, associating a (time-invariant) name to each id.

The other tables in the database are user configurable “property tables” that must obey certain rules:

  1. They have at least the following columns:

    • det_id integer

    • time0 integer (unix timestamp)

    • time1 integer (unix timestamp)

  2. The values time0 and time1 define an interval [time0,time1) over which the data in the row is valid. Every row shall respect the constraint that time0 <= time1.

  3. No two rows in a property table shall have the same det_id and overlapping time intervals. Note that since the intervals are half-open, the intervals [t0, t1) and [t1, t2) do not overlap.

ALWAYS = (0.0, 4000000000.0)

A time-range that is meant to signify “all reasonable times”; in this case it spans from years 1970 - 2096.

TABLE_TEMPLATE = ['`det_id` integer', '`time0` integer', '`time1` integer']

Column definitions (a list of strings) that must appear in all Property Tables.

__init__(map_file=None, init_db=True)[source]

Instantiate a DetDb. If map_file is provided, the database will be connected to the indicated sqlite file on disk, and any changes made to this object be written back to the file.

validate()[source]

Checks that the database is following internal rules. Specifically we check that a dets table exists and has the necessary columns; then we check that all other tables do not have overlapping property intervals. Raises SchemaError in the first case, IntervalError in the second.

create_table(table_name, column_defs, raw=False, commit=True)[source]

Add a property table to the database.

Parameters:
  • table_name (str) – The name of the new table.

  • column_defs (list) – A list of sqlite column definition strings.

  • raw (bool) – See below.

  • commit (bool) – Whether to commit the changes to the db.

The special columns det, time0 and time1 will be pre-pended unless raw=True. An example of column_defs is:

column_defs=[
  "`x_pos` float",
  "`y_pos` float",
]
copy(map_file=None, overwrite=False)[source]

Duplicate the current database into a new database object, and return it. If map_file is specified, the new database will be connected to that sqlite file on disk. Note that a quick way of writing a Db to disk to call copy(map_file=…) and then simply discard the returned object.

to_file(filename, overwrite=True, fmt=None)[source]

Write the present database to the indicated filename.

Parameters:
  • filename (str) – the path to the output file.

  • overwrite (bool) – whether an existing file should be overwritten.

  • fmt (str) – ‘sqlite’, ‘dump’, or ‘gz’. Defaults to ‘sqlite’ unless the filename ends with ‘.gz’, in which it is ‘gz’.

classmethod from_file(filename, fmt=None, force_new_db=True)[source]

This method calls sotodlib.core.metadata.common.sqlite_from_file()

reduce(dets=None, time0=None, time1=None, inplace=False)[source]

Discard information from the database unless it is “relevant”.

Parameters:
  • dets (list or ResultSet) – A list of detectors names that are relevant. If this is a ResultSet, the ‘name’ column is used. If None, all dets are relevant.

  • time0 (int) – If time1 is None, then a property’s time range must contain this time for it to be considered relevant. If time1 is not None, see below.

  • time1 (int) – Along with time0, forms a time range that have non-zero intersection with the property’s time range for the entry to be considered relevant.

  • inplace (bool) – Whether to act on the present object, or to return a modified copy.

Returns the reduced data (which is self, if inplace is True).

get_id(name, commit=True, create=True)[source]

Returns a detector’s internal id. If the detector isn’t in the dets table yet, and create==True, then it is added.

add_props(table_, name_, time_range=None, commit=True, **kw)[source]

Add property information for a detector.

Parameters:
  • table (str) – The property table name.

  • name (str) – The detector name.

  • time_range (pair of ints) – The time range over which the property value is applicable.

  • commit (bool) – Whether or not to commit the db.

All other keyword arguments are interpreted as data to write into the property table.

dets(timestamp=None, props={})[source]

Get a list of detectors matching the conditions listed in the “props” dict. If timestamp is not provided, then time range restriction is not applied.

Returns a list of detector names.

props(dets=None, timestamp=None, props=None, concise=False)[source]

Get the value of the properties listed in props, for each detector identified in dets (a list of strings, or a ResultSet with a column called ‘name’).

intersect(*specs, resolve=False)[source]

Intersect the provided detector specs. Each entry is either a list (or similar iterable) of detector names, or a dictionary specifying detector properties.

If resolve=True, then the returned item is a list (rather than, possibly, a dict).

sotodlib.core.metadata.detdb.get_example()[source]

Returns an example DetDb, mapped to RAM. The two property tables are called “base” and “geometry”. This example is for demonstrating the code and interface and has no relation to any instrument’s actual detector layout!

ObsDb: Observation Database

Overview

The purpose of the ObsDb is to help a user select observations based on high level criteria (such as the target field, the speed or elevation of the scan, whether the observation was done during the day or night, etc.). The ObsDb is also used, by the Context system, to select the appropriate metadata results from a metadata archive (for example, the timestamp associated with an observation could be used to find an appropriate pointing offset).

The ObsDb is constructed from two tables. The “obs table” contains one row per observation and is appropriate for storing basic descriptive data about the observation. The “tags table” associates observations to particular labels (called tags) in a many-to-many relationship.

The ObsDb code does not really require that any particular information be present; it does not insist that there is a “timestamp” field, for example. Instead, the ObsDb can contain information that is needed in a specific context. However, recommended field names for some typical information types are given in Standardized ObsDb field names.

Note

The difference between ObsDb and ObsFileDb is that ObsDb contains arbitrary high-level descriptions of Observations, while ObsFileDb only contains information about how the Observation data is organized into files on disk. ObsFileDb is consulted when it is time to load data for an observation into memory. The ObsDb is a repository of information about Observations, independent of where on disk the data lives or how the data files are organized.

Creating an ObsDb

To create an ObsDb, one must define columns for the obs table, and then add data for one or more observations, then write the results to a file.

Here is a short program that creates an ObsDb with entries for two observations:

from sotodlib.core import metadata

# Create a new Db and add two columns.
obsdb = metadata.ObsDb()
obsdb.add_obs_columns(['timestamp float', 'hwp_speed float', 'drift string'])

# Add two rows.
obsdb.update_obs('myobs0', {'timestamp': 1900000000.,
                            'hwp_speed': 2.0,
                            'drift': 'rising'})
obsdb.update_obs('myobs1', {'timestamp': 1900010000.,
                            'hwp_speed': 1.5,
                            'drift': 'setting'})

# Apply some tags (this could have been done in the calls above
obsdb.update_obs('myobs0', tags=['hwp_fast', 'cryo_problem'])
obsdb.update_obs('myobs1', tags=['hwp_slow'])

# Save (in gzip format).
obsdb.to_file('obsdb.gz')

The column definitions must be specified in a format compatible with sqlite; see ObsDb.add_obs_columns(). When the user adds data using ObsDb.update_obs(), the first argument is the obs_id. This is the primary key in the ObsDb and is also used to identify observations in the ObsFileDb. When we write the database using ObsDb.to_file(), using a .gz extension selects gzip output by default.

Using an ObsDb

The ObsDb.query() function is used to get a list of observations with particular properties. The user may pass in an sqlite-compatible expression that refers to columns in the obs table, or to the names of tags.

Basic queries

Using our example database from the preceding section, we can try a few queries:

>>> obsdb.query()
ResultSet<[obs_id,timestamp,hwp_speed], 2 rows>

>>> obsdb.query('hwp_speed >= 2.')
ResultSet<[obs_id,timestamp,hwp_speed], 1 rows>

>>> obsdb.query('hwp_speed > 1. and drift=="rising"')
ResultSet<[obs_id,timestamp,hwp_speed], 1 rows>

The object returned by obsdb.query is a ResultSet, from which individual columns or rows can easily be extracted:

>>> rs = obsdb.query()
>>> print(rs['obs_id'])
['myobs0' 'myobs1']
>>> print(rs[0])
OrderedDict([('obs_id', 'myobs0'), ('timestamp', 1900000000.0),
  ('hwp_speed', 2.0), ('drift', 'rising')])

Queries involving tags

Information from the tags table will only show up in the output if explicitly requested. For example, we can ask for the 'hwp_fast' and 'hwp_slow' fields to be included:

>>> obsdb.query(tags=['hwp_fast', 'hwp_slow'])
ResultSet<[obs_id,timestamp,hwp_speed,drift,hwp_fast,hwp_slow], 2 rows>

Tag columns will have value 1 if the tag has been applied to that observation, and 0 otherwise. A query can be filtered based on tags; there are two ways to do this. One is to append ‘=0’ or ‘=1’ to the end of some of the tag strings:

>>> obsdb.query(tags=['hwp_fast=1'])
ResultSet<[obs_id,timestamp,hwp_speed,drift,hwp_fast], 1 rows>

Alternately, the values of tags can be used in query strings:

>>> obsdb.query('(hwp_fast==1 and drift=="rising") or (hwp_fast==0 and drift="setting")',
  tags=['hwp_fast'])
ResultSet<[obs_id,timestamp,hwp_speed,drift,hwp_fast], 2 rows>

Getting a description of a single observation

If you just want the basic information for an observation of known obs_id, use the get function ObsDb.get() function:

>>> obsdb.get('myobs0')
OrderedDict([('obs_id', 'myobs0'), ('timestamp', 1900000000.0),
  ('hwp_speed', 2.0), ('drift', 'rising')])

If you want a list of all tags for an observation, call get with tags=True:

>>> obsdb.get('myobs0', tags=True)
OrderedDict([('obs_id', 'myobs0'), ('timestamp', 1900000000.0),
  ('hwp_speed', 2.0), ('drift', 'rising'),
  ('tags', ['hwp_fast', 'cryo_problem'])])

So here we see that the observation is associated with tags 'hwp_fast' and 'cryo_problem'.

Standardized ObsDb field names

Other than obs_id, specific field names are not enforced in code. However, there are certain typical bits of information for which it makes sense to strongly encourage standardized field names. These are defined below.

timestamp

A specific moment (as a Unix timestamp) that should be used to represent the observation. Best practice is to have this be fairly close to the start time of the observation.

duration

The approximate length of the observation in seconds.

Class auto-documentation

The class documentation of ObsDb should appear below.

class sotodlib.core.metadata.ObsDb(map_file=None, init_db=True)[source]

Observation database.

The ObsDb helps to associate observations, indexed by an obs_id, with properties of the observation that might be useful for selecting data or for identifying metadata.

The main ObsDb table is called ‘obs’, and contains the columns obs_id (string), plus any others deemed important for this context (you will probably find timestamp (float representing a unix timestamp)). Additional columns may be added to this table as needed.

The second ObsDb table is called ‘tags’, and facilitates grouping observations together using string labels.

__init__(map_file=None, init_db=True)[source]

Instantiate an ObsDb.

Parameters:
  • map_file (str or sqlite3.Connection) – If this is a string, it will be treated as the filename for the sqlite3 database, and opened as an sqlite3.Connection. If this is an sqlite3.Connection, it is cached and used. If this argument is None (the default), then the sqlite3.Connection is opened on ‘:memory:’.

  • init_db (bool) – If True, then any ObsDb tables that do not already exist in the database will be created.

Notes

If map_file is provided, the database will be connected to the indicated sqlite file on disk, and any changes made to this object be written back to the file.

add_obs_columns(column_defs, ignore_duplicates=True, commit=True)[source]

Add columns to the obs table.

Parameters:
  • column_defs (list of pairs of str) – Column descriptions, see notes.

  • ignore_duplicates (bool) – If true, requests for new columns will be ignored if the column name is already present in the table.

Returns:

self.

Notes

The input format for column_defs is somewhat flexible. First of all, if a string is passed in, it will converted to a list by splitting on “,”. Second, if the items in the list are strings (rather than tuples), the string will be broken into 2 components by splitting on whitespace. Finally, each pair of items is interpreted as a (name, data type) pair. The name can be a simple string, or a string inside backticks; so ‘timestamp’ and ‘timestamp’ are equivalent. The data type can be any valid sqlite type expression (e.g. ‘float’, ‘varchar(256)’, etc) or it can be one of the three basic python type objects: str, float, int. Here are some examples of valid column_defs arguments:

[('timestamp', float), ('drift', str)]
['`timestamp` float', '`drift` varchar(32)']
'timestamp float, drift str'
update_obs(obs_id, data={}, tags=[], commit=True)[source]

Update an entry in the obs table.

Parameters:
  • obs_id (str) – The id of the obs to update.

  • data (dict) – map from column_name to value.

  • tags (list of str) – tags to apply to this observation (if a tag name is prefxed with ‘!’, then the tag name will be un-applied, i.e. cleared from this observation.

Returns:

self.

copy(map_file=None, overwrite=False)[source]

Duplicate the current database into a new database object, and return it. If map_file is specified, the new database will be connected to that sqlite file on disk. Note that a quick way of writing a Db to disk to call copy(map_file=…) and then simply discard the returned object.

to_file(filename, overwrite=True, fmt=None)[source]

Write the present database to the indicated filename.

Parameters:
  • filename (str) – the path to the output file.

  • overwrite (bool) – whether an existing file should be overwritten.

  • fmt (str) – ‘sqlite’, ‘dump’, or ‘gz’. Defaults to ‘sqlite’ unless the filename ends with ‘.gz’, in which it is ‘gz’.

classmethod from_file(filename, fmt=None, force_new_db=True)[source]

This method calls sotodlib.core.metadata.common.sqlite_from_file()

get(obs_id=None, tags=None, add_prefix='')[source]

Returns the entry for obs_id, as an ordered dict.

If obs_id is None, returns all entries, as a ResultSet. However, this usage is deprecated in favor of self.query().

Parameters:
  • obs_id (str) – The observation id to get info for.

  • tags (bool) – Whether or not to load and return the tags.

  • add_prefix (str) – A string that will be prepended to each field name. This is for the lazy metadata system, because obsdb selectors are prefixed with ‘obs:’.

Returns:

An ordered dict with the obs table entries for this obs_id, or None if the obs_id is not found. If tags have been requested, they will be stored in ‘tags’ as a list of strings.

query(query_text='1', tags=None, sort=['obs_id'], add_prefix='')[source]

Queries the ObsDb using user-provided text. Returns a ResultSet.

Parameters:
  • query_text (str) – The sqlite query string. All fields should refer to the obs table, or to tags explicitly listed in the tags argument.

  • tags (list of str) – Tags to include in the output; if they are listed here then they can also be used in the query string. Filtering on tag value can be done here by appending ‘=0’ or ‘=1’ to a tag name.

Returns:

A ResultSet with one row for each Observation matching the criteria.

Notes

Tags are added to the output on request. For example, passing tags=[‘planet’,’stare’] will cause the output to include columns ‘planet’ and ‘stare’ in addition to all the columns defined in the obs table. The value of ‘planet’ and ‘stare’ in each row will be 0 or 1 depending on whether that tag is set for that observation. We can include expressions involving planet and stare in the query, for example:

obsdb.query('planet=1 or stare=1', tags=['planet', 'stare'])

For simple filtering on tags, pass ‘=1’ or ‘=0’, like this:

obsdb.query(tags=['planet=1','hwp=1'])

When filtering is activated in this way, the returned results must satisfy all the criteria (i.e. the individual constraints are AND-ed).

info()[source]

Return a dict summarizing the structure and contents of the obsdb; this is used by the CLI.

ObsFileDb: Observation File Database

The purpose of ObsFileDb is to provide a map into a large set of TOD files, giving the names of the files and a compressed expression of what time indices and detectors are present in each file.

The ObsFileDb is an sqlite database. It carries some information about each “Observation” and the “detectors”; but is complementary to the ObsDb and DetDb.

Data Model

We assume the following organization of the time-ordered data:

  • The data are divided into contiguous segments of time called “Observations”. An observation is identified by an obs_id, which is a string.

  • An Observation involves a certain set of co-sampled detectors. The files associated with the Observation must contain data for all the Observation’s detectors at all times covered by the Observation.

  • The detectors involved in a particular Observation are divided into groups called detsets. The purpose of detset grouping is to map cleanly onto files, thus each file in the Observation should contain the data for exactly one detset.

Here’s some ascii art showing an example of how the data in an observation must be split between files:

    sample index
  +-------------------------------------------------->
d |
e |   +-------------------------+------------------+
t |   | obs0_waf0_00000         | obs0_waf0_01000  |
e |   +-------------------------+------------------+
c |   | obs0_waf1_00000         | obs0_waf1_01000  |
t |   +-------------------------+------------------+
o |   | obs0_waf2_00000         | obs0_waf2_01000  |
r |   +-------------------------+------------------+
  V

In this example the data for the observation has been distributed into 6 files. There are three detsets, probably called waf0, waf1, and waf2. In the sample index (or time) direction, each detset is associated with two files; apparently the observation has been split at sample index 1000.

Notes:

  • Normally detsets will be coherent across a large set of observations – i.e. because we will probably always group the detectors into files in the same way. But this is not required.

  • In the case of non-cosampled arrays that are observing at the same time on the same telescope: these qualify as different observations and should be given different obs_ids.

  • It is currently assumed that in a single observation the files for each detset will be divided at the same sample index. The database structure doesn’t have this baked in, but some internal verification code assumes this behavior. So this requirement can likely be loosened, if need be.

The database consists of two main tables. The first is called detsets and associates detectors (string detsets.det) with a particular detset (string detset.name). The second is called files and associates files (files.name to each Observation (string files.obs_id), detset (string files.detset), and sample range (integers sample_start and sample_stop).

Constructing the ObsFileDb involves building the detsets and files tables, using functions add_detset and add_obsfile. Using the ObsFileDb is accomplished through the functions get_dets, get_detsets, get_obs, and through custom SQL queries on conn.

Relative vs. Absolute Paths

The filenames stored in ObsFileDb may be specified with relative or absolute paths. (Absolute paths are assumed if the filename starts with a /.) Relative paths are taken as being relative to the directory where the ObsFileDb sqlite file lives; this can be overridden by setting the prefix attribute. Consider the following file listing as an example:

/
 data/
      planet_obs/
                 obsfiledb.sqlite     # the database
                 observation0/        # directory for an obs
                              data0_0 # one data file
                              data0_1
                 observation1/
                              data1_0
                              data1_1
                 observation2/
                 ...

Note the obsfiledb.sqlite file, located in /data/planet_obs. The filenames in obsfiledb.sqlite might be specified in one of two ways:

  1. Using paths relative to the directory where obsfiledb.sqlite is located. For example, observation0/data0_1. Relative paths permit one to move the tree of data to other locations without needing to alter the obsfiledb.sqlite (as long as the relative locations of the data and sqlite file remain fixed).

  2. Using absolute paths on this file system; for example /data/planet_obs/observation0/data0_1. This is not portable, but it is a better choice if the ObsFileDb .sqlite file isn’t kept near the TOD data files.

A database may contain a mixture of relative and absolute paths.

Example Usage

Suppose we have a coherent archive of TOD data files living at /mnt/so1/shared/todsims/pipe-s0001/v2/. And suppose there’s a database file, obsfiledb.sqlite, in that directory. We can load the observation database like this:

import sotoddb
db = sotoddb.ObsFileDb.from_file('/mnt/so1/shared/todsims/pip-s0001/v2/')

Note we’ve given it a directory, not a filename… in such cases the code will read obsfiledb.sqlite in the stated directory.

Now we get the list of all observations, and choose one:

all_obs = db.get_obs()
print(all_obs[0])   # -> 'CES-Atacama-LAT-Tier1DEC-035..-045_RA+040..+050-0-0_LF'

We can list the detsets present in this observation; then get all the file info (paths and sample indices) for one of the detsets:

all_detsets = db.get_detsets(all_obs[0])
print(all_detsets)  # -> ['LF1_tube_LT6', 'LF2_tube_LT6']

files = db.get_files(all_obs[0], detsets=[all_detsets[0]])
print(files['LF1_tube_LT6'])
                    # -> [('/mnt/so1/shared/todsims/pipe-s0001/v2/datadump_LAT_LF1/CES-Atacama-LAT-Tier1DEC-035..-045_RA+040..+050-0-0/LF1_tube_LT6_00000000.g3', 0, None)]

Class Documentation

The class documentation of ObsFileDb should appear below.

class sotodlib.core.metadata.ObsFileDb(map_file=None, prefix=None, init_db=True, readonly=False)[source]

sqlite3-based database for managing large archives of files.

The data model here is that each distinct “Observation” comprises co-sampled detector data for a large number of detectors. Each detector belongs to a single “detset”, and there is a set of files containing the data for each detset. Finding the file that contains data for a particular detector is a matter of looking up what detset the detector is in, and looking up what file covers that detset.

Note that many functions have a “commit” option, which simply affects whether the .commit is called on the database or not (it can be faster to suppress commit ops when a running a batch of updates, and commit manually at the end).

__init__(map_file=None, prefix=None, init_db=True, readonly=False)[source]

Instantiate an ObsFileDb.

Parameters:
  • map_file (string) – sqlite database file to map. Defaults to ‘:memory:’.

  • prefix (string) – as described in class documentation.

  • init_db (bool) – If True, attempt to create the database tables.

  • readonly (bool) – If True, the database file will be mapped in read-only mode. Not valid on dbs held in :memory:.

conn = None

The sqlite3 database connection.

prefix = ''

Path relative to which filenames in the database should be interpreted. This only applies to relative filenames (those not starting with /).

to_file(filename, overwrite=True, fmt=None)[source]

Write the present database to the indicated filename.

Parameters:
  • filename (str) – the path to the output file.

  • overwrite (bool) – whether an existing file should be overwritten.

  • fmt (str) – ‘sqlite’, ‘dump’, or ‘gz’. Defaults to ‘sqlite’ unless the filename ends with ‘.gz’, in which it is ‘gz’.

classmethod from_file(filename, prefix=None, fmt=None, force_new_db=True)[source]

This method calls sotodlib.core.metadata.common.sqlite_from_file()

classmethod for_dir(path, filename='obsfiledb.sqlite', readonly=True)[source]

Deprecated; use from_file().

copy(map_file=None, overwrite=False)[source]

Duplicate the current database into a new database object, and return it. If map_file is specified, the new database will be connected to that sqlite file on disk. Note that a quick way of writing a Db to disk to call copy(map_file=…).

add_detset(detset_name, detector_names, commit=True)[source]

Add a detset to the detsets table.

Parameters:
  • detset_name (str) – The (unique) name of this detset.

  • detector_names (list of str) – The detectors belonging to this detset.

add_obsfile(filename, obs_id, detset, sample_start=None, sample_stop=None, commit=True)[source]

Add an observation file to the files table.

Parameters:
  • filename (str) – The filename, relative to the data base directory and without a leading /.

  • obs_id (str) – The observation id.

  • detset (str) – The detset name.

  • sample_start (int) – The observation sample index at the start of this file.

  • sample_stop (int) – sample_start + n_samples.

get_obs()[source]

Returns all a list of all obs_id present in this database.

get_obs_with_detset(detset)[source]

Returns a list of all obs_ids that include a specified detset

get_detsets(obs_id)[source]

Returns a list of all detsets represented in the observation specified by obs_id.

get_dets(detset)[source]

Returns a list of all detectors in the specified detset.

get_det_table(obs_id)[source]

Get table of detectors and detsets suitable for use with Context det_info. Returns Resultset with keys=[‘dets:detset’,’dets:readout_id’].

get_files(obs_id, detsets=None, prefix=None)[source]

Get the file names associated with a particular obs_id and detsets.

Returns:

OrderedDict where the key is the detset name and the value is a list of tuples of the form (full_filename, sample_start, sample_stop).

lookup_file(filename, resolve_paths=True, prefix=None, fail_ok=False)[source]

Determine what, if any, obs_id (and detset and sample range) is associated with the specified data file.

Parameters:
  • filename (str) – a string corresponding to a file that is covered by this db. See note on how this is resolved.

  • resolve_paths (bool) – If True, then the incoming filename is treated as a path to a specific file on disk, and the database is queried for files that also resolve to that are equivalent to that same file on disk (accounting for prefix). If False, then the incoming filename is taken as opaque text to match against the corresponding entries in the obsfiledb file “name” column (including whatever path information is in either of those strings).

  • fail_ok (bool) – If True, then None is returned if the filename is not found in the db (instead of raising RuntimeError).

Returns:

  • obs_id: The obs_id

  • detsets: A list containing the name of the single detset covered by this file.

  • sample_range: A tuple with the start and stop sample indices for this file.

Return type:

A dict with entries

verify(prefix=None)[source]

Check the filesystem for the presence of files described in the database. Returns a dictionary containing this information in various forms; see code for details.

This function is used internally by the drop_incomplete() function, and may also be useful for debugging file-finding problems.

drop_obs(obs_id)[source]

Delete the specified obs_id from the database. Returns a list of files that are no longer covered by the database (with prefix).

drop_detset(detset)[source]

Delete the specified detset from the database. Returns a list of files that are no longer covered by the database (with prefix).

drop_incomplete()[source]

Compare the files actually present on the system to the ones listed in this database. Drop detsets from each observation, as necessary, such that the database is consistent with the file system.

Returns a list of files that are on the system but are no longer included in the database.

get_file_list(fout=None)[source]

Returns a list of all files in the database, without the file prefix, sorted by observation / detset / sample_start. This is the sort of list one might use with rsync –files-from.

If you pass an open file or filename to fout, the names will be written there, too.

Command Line Tool

The so-metadata script is a tool that can be used to inspect and alter the contents of ObsFileDb, ObsDb, and ManifestDb (“metadata”) sqlite3 databases. In the case of ObsFileDb and ManifestDb, it can also be used to perform batch filename updates.

To summarize a database, pass the db type and then the path to the db file. It might be convenient to start by printing a summary of the context.yaml file, as this will give full paths to the various databases, that can be copied and pasted:

so-metadata context /path/to/context.yaml

Analyzing individual databases:

so-metadata obsdb /path/to/context/obsdb.
so-metadata obsfiledb /path/to/context/obsfiledb.sqlite
so-metadata metadata /path/to/context/some_metadata.sqlite

Usage

Read context / metadata databases and print summary or detailed information. You have to pass a db type (obsdb, obsfiledb, …) followed by the a db filename (or the path to a context.yaml, from which the relevant db will be loaded).

usage: so-metadata [-h] {context,obsdb,obsfiledb,metadata} ...

Positional Arguments

_subcmd

Possible choices: context, obsdb, obsfiledb, metadata

Sub-commands

context

Inspect a context file and print out a few things about it.

so-metadata context [-h] ctx_file
Positional Arguments
ctx_file

Path to Context yaml file.

obsdb

Inspect an ObsDb. By default, prints out summary information. Pass –list to see a list of all observations; or pass an obs_id to get detailed info about that obs.

so-metadata obsdb [-h] [--type {context,db}] [--list] [--query QUERY]
                  [--key KEY] [--tag TAG]
                  db_file [obs_id]
Positional Arguments
db_file

Path to database (or context.yaml).

obs_id
Named Arguments
--type

Possible choices: context, db

Specifies how to interpret the file (to override guessing based on the filename extension.

--list, -l

List all observations

Default: False

--query, -q

Restrict listed items with an ObsDb query string.

Default: “1”

--key, -k

If listing observations, also include the speficied db fields in addition to obs_id.

Default: []

--tag, -t

Add a tag as an eligible column for query.

Default: []

obsfiledb

Inspect an ObsFileDb. This can be used to list all files, and to perform batch updates of filenames.

so-metadata obsfiledb [-h] obsfiledb.sqlite {files,reroot,fix-db} ...
Positional Arguments
obsfiledb.sqlite

Path to an ObsFileDb.

mode

Possible choices: files, reroot, fix-db

Sub-commands
files

List the files referenced in the database.

Syntax:

    so-metadata obsfiledb files
    so-metadata obsfiledb files --all
    so-metadata obsfiledb files --clean

        This will print out a list of the files in the db,
        along with obs_id and detset.  Only a few lines
        will be shown, unless --all is passed.  To get a
        simple list of all files (for rsync or something),
        pass --clean.
Named Arguments
--clean

Print a simple list of all files (for script digestion).

Default: False

--all

Print all files, not an abbreviated list.

Default: False

reroot

Batch change filenames (by prefix) in the database.

Syntax:

    so-metadata obsfiledb reroot old_prefix new_prefix [output options]

Examples:

    so-metadata obsfiledb reroot /path/on/system1 /path/on/system2 -o my_new_manifest.sqlite
    so-metadata obsfiledb reroot /path/on/system1 /new_path/on/system1 --overwrite
    so-metadata obsfiledb reroot ./result1/obs_12345.h5 ./result2/obs_12345.h5 --overwrite

        These operations will create a duplicate of the source
        ObsFileDb, with only the filenames (potentially) altered.  Any
        filename that starts with the first argument will be changed,
        in the output, to instead start with the second argument.
        When you do this you must either say where to write the output
        (-o) or give the program permission to overwrite your input
        database file.  Note that the first argument need not match
        all entries in the database; you can use it to pick out a
        subset (even a single entry).
Positional Arguments
old_prefix

Prefix to match in current database.

new_prefix

Prefix to replace it with.

Named Arguments
--overwrite

Store modified database in the same file.

Default: False

--output-db, -o

Store modified database in this file.

--dry-run

Run the conversion steps but do not write the results anywhere.

Default: False

fix-db

Upgrade database (schema fixes, etc).

Syntax:

    so-metadata obsfiledb fix-db [output options]
Named Arguments
--overwrite

Store modified database in the same file.

Default: False

--output-db, -o

Store modified database in this file.

--dry-run

Run the conversion steps but do not write the results anywhere.

Default: False

metadata

Inspect (or modify) a ManifestDb.

so-metadata metadata [-h]
                     my_db.sqlite {summary,entries,files,lookup,reroot} ...
Positional Arguments
my_db.sqlite

Path to a ManifestDb.

mode

Possible choices: summary, entries, files, lookup, reroot

Sub-commands
summary

Summarize database structure and number of entries.

so-metadata metadata summary

        This will print a summary of the index fields and
        endpoint fields.  (This mode is chosen by default.)
entries

Show all entries in the database.

Syntax:

    so-metadata metadata entries

        This will print every row of the metadata map table,
        including the filename, and with two header rows.
files

List the files referenced in the database.

Syntax:

    so-metadata metadata files
    so-metadata metadata files --all
    so-metadata metadata files --clean

        This will print out a list of archive files referenced
        by the database.  --all prints all the rows, even if
        there are a lot of them.  --clean is used to get a
        simple list, one file per line, for use with rsync
        or whatever.
Named Arguments
--clean

Print a simple list of all files (for script digestion).

Default: False

--all

Print all files, not an abbreviated list.

Default: False

lookup

Query database for specific index data and display matched endpoint data.

Syntax:

    so-metadata metadata lookup val1,val2,... [val1,val2,... ]

        Each command line argument is converted to a single query, and
        must consist of comma-delimited fields to be associated
        one-to-one with the index fields.

    Example 1: if the single index field is "obs:obs_id":

        so-metadata metadata lookup obs_1840300000

    Example 2: do two queries (single index field)

        so-metadata metadata lookup obs_1840300000 obs_185040012

    Example 3: single query, two index fields (perhaps timestamp and wafer):

        so-metadata metadata lookup 1840300000,wafer29
Positional Arguments
index

Index information. Comma-delimit your data, for example: 1456453245,wafer5

reroot

Batch change filenames (by prefix) in the database.

Syntax:

    so-metadata metadata reroot old_prefix new_prefix [output options]

Examples:

    so-metadata metadata reroot /path/on/system1 /path/on/system2 -o my_new_manifest.sqlite
    so-metadata metadata reroot /path/on/system1 /new_path/on/system1 --overwrite
    so-metadata metadata reroot ./result1/obs_12345.h5 ./result2/obs_12345.h5 --overwrite

        These operations will create a new ManifestDb, will all the
        entries from my_db.sqlite, but with the filenames
        (potentially) altered.  Any filename that starts with the
        first argument will be changed, in the output, to instead
        start with the second argument.  When you do this you must
        either say where to write the output (-o) or give the program
        permission to overwrite your input database file.  Note that
        the first argument need not match all entries in the database;
        you can use it to pick out a subset (even a single entry).
Positional Arguments
old_prefix

Prefix to match in current database.

new_prefix

Prefix to replace it with.

Named Arguments
--overwrite

Store modified database in the same file.

Default: False

--output-db, -o

Store modified database in this file.

--dry-run

Run the conversion steps but do not write the results anywhere.

Default: False