Skip to content

Empirical and bootstrap

Finite-sample distributions backed by stored draws or by an underlying data source. .n reports the count; .samples, .draws(), and .components access the stored items.

EmpiricalDistribution(samples, weights=None, *, log_weights=None, name=None)

Bases: Distribution[T], SupportsSampling, SupportsExpectation

Weighted empirical distribution over a finite set of samples.

This is the generic base. Concrete sample types T (objects, callables, opaque user values, ...) are stored in a numpy object array.

Automatic Record dispatch: EmpiricalDistribution(samples, ...) returns a RecordEmpiricalDistribution when

  • samples is a Record (each field stacked along axis 0),
  • or samples is a numeric JAX/numpy array and name=... is passed (the array auto-wraps as a single-field Record({name: arr})).

Otherwise, the generic base is returned and stores samples as a numpy object array.

Parameters:

Name Type Description Default
samples Record | sequence of T | array-like

The support points. Numeric-array inputs require name= so the auto-wrapped Record has a field name; without it construction raises ValueError.

required
weights array-like, :class:`~probpipe.Weights`, or None

Non-negative weights (normalised internally). Mutually exclusive with log_weights. Uniform when neither is given.

None
log_weights array-like, :class:`~probpipe.Weights`, or None

Log-unnormalised weights. Mutually exclusive with weights.

None
name str

Distribution name. Mandatory when samples is a bare numeric array.

None
Source code in probpipe/core/_empirical.py
def __init__(
    self,
    samples: Sequence[T] | ArrayLike,
    weights: ArrayLike | Weights | None = None,
    *,
    log_weights: ArrayLike | Weights | None = None,
    name: str | None = None,
):
    # Generic-T storage: a numpy object array.
    if isinstance(samples, (jnp.ndarray, np.ndarray)):
        self._samples = samples
    else:
        self._samples = np.asarray(samples, dtype=object)
    n = len(self._samples)
    if n == 0:
        raise ValueError("samples must be a non-empty sequence.")
    self._w = Weights(n=n, weights=weights, log_weights=log_weights)
    if name is None:
        name = "empirical"
    super().__init__(name=name)
    self._approximate = True

n property

Number of samples.

samples property

Stored samples.

is_uniform property

True when all samples have equal weight.

weights property

Normalised weights, shape (n,).

log_weights property

Normalised log-weights, shape (n,).

effective_sample_size property

Kish's effective sample size (ESS).

RecordEmpiricalDistribution(samples, weights=None, *, log_weights=None, sample_shape=None, name=None)

Bases: EmpiricalDistribution[Record], NumericRecordDistribution, SupportsMean, SupportsVariance, SupportsCovariance

Empirical distribution over Record-structured numeric samples.

Each sample is a row of the stored Record: if the data has fields X of shape (n, p) and y of shape (n,), then a single draw is Record(X=array(p,), y=scalar). Joint row indexing preserves per-observation correlation across fields during sampling and resampling.

A bare numeric array auto-wraps as a single-field Record keyed by name — that is the migration path for the previous NumericEmpiricalDistribution(arr) form. The auto-wrap requires name= so the field's identity is unambiguous downstream.

Inherits NumericRecordDistribution shape semantics (record_template, event_shapes, event_size, batch_shape) plus exact weighted moments (mean, variance, cov) over each field.

Parameters:

Name Type Description Default
samples Record | array - like

Sample data. A Record's fields each stack along axis 0; a numeric array auto-wraps as Record({name: arr}).

required
weights array-like, :class:`~probpipe.Weights`, or None

Non-negative weights (normalised internally). Mutually exclusive with log_weights.

None
log_weights array-like, :class:`~probpipe.Weights`, or None

Log-unnormalised weights. Mutually exclusive with weights.

None
sample_shape tuple of int

Only valid for numeric-array auto-wrap: leading-axis sample shape; trailing axes form the field's event shape.

None
name str

Distribution name. Required when samples is a numeric array (used as the auto-wrapped field name).

None
Notes

Construction calls Distribution.__init__ directly rather than chaining through super().__init__(). The reason: the generic EmpiricalDistribution[T] base stores samples as a flat numpy object array (self._samples), which is incompatible with the Record-structured layout this subclass uses (self._record_data). Subclasses that further specialise this class (e.g. ApproximateDistribution) should likewise call RecordEmpiricalDistribution.__init__ rather than super().__init__ if they need to skip the generic-base storage path.

Source code in probpipe/core/_empirical.py
def __init__(
    self,
    samples: Record | ArrayLike,
    weights: ArrayLike | Weights | None = None,
    *,
    log_weights: ArrayLike | Weights | None = None,
    sample_shape: tuple[int, ...] | None = None,
    name: str | None = None,
):
    if not isinstance(samples, Record):
        if not _is_numeric_array(samples):
            raise TypeError(
                f"RecordEmpiricalDistribution: samples must be a "
                f"Record or a numeric array, got "
                f"{type(samples).__name__}"
            )
        samples, name = _wrap_numeric_array_as_record(
            samples, name=name, sample_shape=sample_shape,
            role="RecordEmpiricalDistribution",
        )
    elif sample_shape is not None:
        raise TypeError(
            "sample_shape is only valid when constructing from a "
            "bare numeric array (single-field auto-wrap path)."
        )
    n = _validate_record_samples(samples)
    self._record_data = samples
    self._n_samples = n
    self._w = Weights(n=n, weights=weights, log_weights=log_weights)
    if name is None:
        name = "empirical(" + ",".join(samples.fields) + ")"
    # Skip EmpiricalDistribution.__init__ (different storage shape);
    # call Distribution.__init__ directly for name registration.
    Distribution.__init__(self, name=name)
    self._approximate = True
    self._record_template = _record_template_from_data(samples)

samples property

Stored stacked-sample data as a structured NumericRecord.

Use self.samples[field_name] for per-field array access. For a flat (n, dim) matrix view across all fields, use flat_samples instead.

flat_samples property

Flat (n, dim) view across all fields, in insertion order.

dim = sum_over_fields(prod(event_shape_f)). Multi-dim event shapes are flattened row-major; field order matches fields. Use samples for the structured per-field view.

Examples:

Single-field auto-wrap with a 1-D event::

EmpiricalDistribution(jnp.zeros((100, 5)), name="theta").flat_samples.shape
# (100, 5)

Multi-field posterior::

posterior = ApproximateDistribution(...)  # mu, log_sigma fields
posterior.flat_samples.shape  # (n, 2)
posterior.flat_samples.mean(axis=0)  # per-parameter posterior mean

event_shape property

Per-sample event shape, single-field only.

For a single-field record (the auto-wrap case from EmpiricalDistribution(arr, name=...)), returns the field's event shape — i.e. arr.shape[1:].

For multi-field records, raises AttributeError rather than returning (): a silent scalar fallback would let callers that aren't multi-field-aware mis-classify a structured posterior as a scalar event. Use event_shapes (plural, dict-valued) for the multi-field case.

See Also

event_shapes — the per-field dict, always available. RecordBootstrapReplicateDistribution.obs_shape — the symmetric single-field-only / multi-field-raises accessor for bootstrap replicates' per-observation event shape.

Raises:

Type Description
AttributeError

If len(self.fields) > 1.

event_shapes property

Per-field event shapes (sample axis stripped).

Always available, including for single-field records. Compare event_shape (singular), which is single-field-only and raises on multi-field.

dim property

Flat dimensionality of a single Record draw.

KDEDistribution(samples, weights=None, *, log_weights=None, bandwidth=None, name=None)

Bases: TFPDistribution

Gaussian kernel density estimate as a ProbPipe distribution.

Wraps a TFP MixtureSameFamily(Categorical, MultivariateNormalDiag) to provide a smooth density approximation from a set of weighted samples. Inherits all protocol implementations from TFPDistribution.

Parameters:

Name Type Description Default
samples array - like

Sample matrix of shape (n,) or (n, d).

required
weights array-like, :class:`~probpipe.Weights`, or None

Non-negative weights. A pre-built Weights object is also accepted. Mutually exclusive with log_weights. When neither is given, uniform weights are used.

None
log_weights array-like, :class:`~probpipe.Weights`, or None

Log-unnormalized weights. A pre-built Weights object is also accepted. Mutually exclusive with weights.

None
bandwidth array - like or None

Per-dimension bandwidth (standard deviation of each Gaussian kernel), shape (d,) or scalar. If None, Silverman's rule is used: n^{-1/(d+4)} * std_j for each dimension j.

None
name str or None

Distribution name for provenance.

None
Source code in probpipe/distributions/kde.py
def __init__(
    self,
    samples: ArrayLike,
    weights: ArrayLike | Weights | None = None,
    *,
    log_weights: ArrayLike | Weights | None = None,
    bandwidth: ArrayLike | None = None,
    name: str | None = None,
):
    samples = _as_float_array(samples)
    if samples.ndim == 0:
        raise ValueError("samples must have at least 1 dimension.")
    if samples.ndim == 1:
        samples = samples[:, None]  # (n,) -> (n, 1)
        self._scalar = True
    else:
        self._scalar = False

    n, d = samples.shape
    self._samples = samples
    self._d = d
    if name is None:
        name = "kde"
    super().__init__(name=name)

    # Weights
    self._w = Weights(n=n, weights=weights, log_weights=log_weights)
    w = self._w.normalized

    # Bandwidth (Silverman's rule default)
    if bandwidth is not None:
        bw = jnp.broadcast_to(jnp.asarray(bandwidth, dtype=samples.dtype), (d,))
    else:
        std = jnp.sqrt(self._w.variance(samples))
        # Silverman's rule: n^{-1/(d+4)} * std
        silverman_factor = n ** (-1.0 / (d + 4))
        bw = silverman_factor * jnp.maximum(std, 1e-8)
    self._bandwidth = bw

    # Build the TFP mixture distribution
    if d == 1:
        components = tfd.Normal(
            loc=samples[:, 0],
            scale=bw[0],
        )
    else:
        components = tfd.MultivariateNormalDiag(
            loc=samples,
            scale_diag=jnp.broadcast_to(bw, (n, d)),
        )
    self._tfp_dist = tfd.MixtureSameFamily(
        mixture_distribution=tfd.Categorical(probs=w),
        components_distribution=components,
    )

n property

Number of kernel centres (samples).

BootstrapDistribution(evaluations, *, weights=None, log_weights=None, name=None)

Bases: NumericRecordDistribution, SupportsSampling, SupportsMean, SupportsVariance

Distribution over bootstrap-resampled means of a statistic.

Given n evaluations f(x_1), ..., f(x_n) where x_i ~ P, this represents the sampling distribution of the sample mean (1/n) sum f(x_i), capturing Monte Carlo error.

Parameters:

Name Type Description Default
evaluations array-like, shape ``(n, *stat_shape)``

The individual f(x_i) values.

required
weights array-like, :class:`~probpipe.Weights`, or None

Non-negative weights (normalized internally). A pre-built Weights object is also accepted. Mutually exclusive with log_weights. When neither is given, uniform weights are used.

None
log_weights array-like, :class:`~probpipe.Weights`, or None

Log-unnormalized weights. A pre-built Weights object is also accepted. Mutually exclusive with weights.

None
name str

Distribution name.

None
Source code in probpipe/core/_numeric_record_distribution.py
def __init__(
    self,
    evaluations: ArrayLike,
    *,
    weights: ArrayLike | Weights | None = None,
    log_weights: ArrayLike | Weights | None = None,
    name: str | None = None,
):
    self._evaluations = _as_float_array(evaluations)
    if self._evaluations.ndim == 0:
        raise ValueError("evaluations must have at least 1 dimension.")
    self._n = self._evaluations.shape[0]
    self._w = Weights(n=self._n, weights=weights, log_weights=log_weights)
    if name is None:
        name = "bootstrap_dist"
    super().__init__(name=name)
    self._approximate = True

n property

Number of function evaluations.

dtypes property

Per-field dtype — the evaluations' dtype spread across the auto-built single-field template.

supports property

Per-field support — bootstrap of mean values is real-valued.

BootstrapReplicateDistribution(source, *, n=None, name=None)

Bases: Distribution[T], SupportsSampling, SupportsExpectation

N-fold product of an empirical distribution (bootstrap resampling).

Each draw from this distribution is a bootstrapped datasetn observations drawn i.i.d. (with replacement) from the source.

Source dispatch:

  • Record / RecordEmpiricalDistribution / numeric array / numeric-array-backed EmpiricalDistribution → returns a RecordBootstrapReplicateDistribution. The numeric array path requires name= (single-field auto-wrap).
  • Any SupportsSampling source (e.g. Normal, a custom Distribution) → stays in the generic base. n is mandatory because no canonical observation count exists; each replicate is n i.i.d. draws from source._sample.
  • Any other sequence → generic base, equally weighted, with object-array storage.

Parameters:

Name Type Description Default
source Record | EmpiricalDistribution | SupportsSampling | sequence

Data to bootstrap from.

required
n int or None

Number of observations per bootstrap dataset. Required when source is a non-array SupportsSampling (no canonical count); defaults to the source's observation count otherwise.

None
name str or None

Distribution name. Mandatory when source is a numeric array (used as the single-field auto-wrap field name).

None
Source code in probpipe/core/_empirical.py
def __init__(
    self,
    source: Any,
    *,
    n: int | None = None,
    name: str | None = None,
):
    # SupportsSampling source: each replicate is n i.i.d. draws from
    # source._sample. n is mandatory (no canonical observation count
    # for a generic sampleable source).
    if (
        isinstance(source, SupportsSampling)
        and not isinstance(source, EmpiricalDistribution)
        and not _is_numeric_array(source)
    ):
        if n is None or n < 1:
            raise ValueError(
                f"BootstrapReplicateDistribution: when source is a "
                f"SupportsSampling distribution (got "
                f"{type(source).__name__}), n must be a positive int "
                f"giving the number of observations per replicate."
            )
        self._source_kind = "sampleable"
        self._source = source
        self._data = None
        self._w = None
        default_n = n
        self._init_bootstrap_state(default_n, n=n, name=name)
        return

    self._source_kind = "data"
    self._source = None
    if isinstance(source, EmpiricalDistribution):
        self._data = source.samples
        self._w = source._w
        default_n = source.n
    elif isinstance(source, (jnp.ndarray, np.ndarray)):
        self._data = source
        if self._data.ndim == 0:
            raise ValueError(
                "source must have at least 1 dimension (the observation axis)."
            )
        if len(self._data) == 0:
            raise ValueError("source must be a non-empty sequence.")
        self._w = Weights.uniform(len(self._data))
        default_n = len(self._data)
    else:
        self._data = np.asarray(source, dtype=object)
        if len(self._data) == 0:
            raise ValueError("source must be a non-empty sequence.")
        self._w = Weights.uniform(len(self._data))
        default_n = len(self._data)
    self._init_bootstrap_state(default_n, n=n, name=name)

n property

Observations per bootstrap dataset.

source_n property

Number of source observations, or None for a sampleable source.

data property

Source data (None for a sampleable source).

weights property

Source weights (None for a sampleable source).

is_uniform property

True when source observations are equally weighted.

RecordBootstrapReplicateDistribution(source, *, n=None, name=None)

Bases: BootstrapReplicateDistribution[Record], NumericRecordDistribution

Bootstrap replicate distribution over Record-structured data.

Each sample is a full bootstrapped dataset: n rows drawn i.i.d. with replacement from the source data, with the same row indices applied jointly across fields.

Inherits NumericRecordDistribution shape semantics (record_template, event_shapes, ...). A bare numeric array source auto-wraps as a single-field Record keyed by name — matching the migration path for the previous ArrayBootstrapReplicateDistribution(arr) form.

Parameters:

Name Type Description Default
source Record | RecordEmpiricalDistribution | array - like

Data to bootstrap from. A bare numeric array auto-wraps as a single-field Record keyed by name. A generic EmpiricalDistribution (object-array storage) is not accepted — see Raises.

required
n int or None

Observations per bootstrap dataset. Defaults to the source's observation count.

None
name str or None

Distribution name. Mandatory when source is a bare numeric array (used as the single-field auto-wrap field name).

None

Raises:

Type Description
TypeError

If source is a generic EmpiricalDistribution (i.e., object-array storage rather than numeric / Record-backed). The factory path routes numeric-array empiricals to RecordEmpiricalDistribution; only the generic-base instance can reach this constructor, and its non-numeric samples can't be bootstrapped meaningfully.

Source code in probpipe/core/_empirical.py
def __init__(
    self,
    source: Any,
    *,
    n: int | None = None,
    name: str | None = None,
):
    if isinstance(source, RecordEmpiricalDistribution):
        self._record_data = source._record_data
        self._w = source._w
        default_n = source.n
    elif isinstance(source, Record):
        n_rows = _validate_record_samples(source)
        self._record_data = source
        default_n = n_rows
        self._w = Weights.uniform(n_rows)
    elif isinstance(source, EmpiricalDistribution):
        # The factory path routes numeric-array-backed empiricals to
        # ``RecordEmpiricalDistribution`` (caught above), so any
        # ``EmpiricalDistribution`` reaching here is a generic-base
        # instance with object-array storage — directly constructing
        # ``RecordBootstrapReplicateDistribution`` with such a source
        # is the only way in. Reject explicitly so users get a clear
        # error instead of an ``_as_float_array(object_arr)``
        # TypeError later.
        sample_dtype = getattr(
            source.samples, "dtype", type(source.samples).__name__,
        )
        raise TypeError(
            f"RecordBootstrapReplicateDistribution does not accept "
            f"generic (object-array) EmpiricalDistribution sources "
            f"(got {type(source).__name__} with non-numeric samples; "
            f"dtype={sample_dtype}). Pass a "
            f"RecordEmpiricalDistribution, a numeric array, or wrap "
            f"your samples in a Record first."
        )
    elif _is_numeric_array(source):
        wrapped, field_name = _wrap_numeric_array_as_record(
            source, name=name,
            role="RecordBootstrapReplicateDistribution",
        )
        self._record_data = wrapped
        n_rows = _validate_record_samples(wrapped)
        default_n = n_rows
        self._w = Weights.uniform(n_rows)
        name = field_name
    else:
        raise TypeError(
            f"RecordBootstrapReplicateDistribution: source must be a "
            f"Record, RecordEmpiricalDistribution, or numeric array, "
            f"got {type(source).__name__}"
        )
    # Bootstrap-base bookkeeping. Set self._data so the base's
    # `.data` property returns the Record (matches old behaviour).
    self._source_kind = "data"
    self._source = None
    self._data = self._record_data
    self._init_bootstrap_state(
        default_n, n=n, name=name, source_n=default_n,
    )
    # Replicate produces (n, *event_shape) per field; advertise that
    # via the record_template.
    self._record_template = _record_template_from_data(
        self._record_data, leading_shape=(self._n,),
    )

event_shapes property

Per-field replicate event shapes (n, *obs_event_shape).

event_shape property

Replicate event shape, single-field only.

For a single-field replicate, returns (n, *per_observation_event_shape) — i.e. the shape of one bootstrap dataset. Multi-field replicates raise AttributeError rather than returning () so silent scalar-fallback bugs don't slip through. Use event_shapes (plural, per-field) for the multi-field case, or obs_shape for the per-observation shape on single-field replicates.

See Also

event_shapes — the per-field dict, always available. obs_shape — the per-observation event shape (replicate axis stripped) for single-field replicates. RecordEmpiricalDistribution.event_shape — the symmetric single-field-only / multi-field-raises accessor on the empirical-distribution side.

Raises:

Type Description
AttributeError

If len(self.fields) > 1.

obs_shape property

Per-observation event shape, single-field only.

For a single-field replicate, returns the per-observation event shape (the field's shape with the sample axis stripped). Multi-field replicates raise AttributeError rather than returning (); use obs_shapes (plural, per-field) for the multi-field case.

See Also

obs_shapes — the per-field dict, always available. event_shape — the full replicate event shape (n, *obs_shape) for single-field replicates.

Raises:

Type Description
AttributeError

If len(self.fields) > 1.

obs_shapes property

Per-field observation event shapes (replicate axis stripped).

dim property

Flat dimensionality of a single bootstrap dataset.

Sum across fields of n * max(1, prod(obs_event_shape)).

JointEmpirical(*, weights=None, log_weights=None, name=None, **samples)

Bases: RecordDistribution, SupportsSampling, SupportsConditioning

Joint distribution from weighted joint samples.

Stores per-component sample arrays (all with the same number of rows) and optional weights. Sampling resamples rows jointly, preserving correlation between components.

Dynamic dispatch via __new__: when every field is a numeric array (numpy, JAX, or numeric scalar), constructing JointEmpirical returns a NumericJointEmpirical instance, which additionally supports mean and variance. Fall through to this base class for mixed / opaque data (e.g. object-dtype arrays of labels).

When used in broadcasting enumeration, the joint is treated as a single unit with n samples (no cartesian decomposition).

Parameters:

Name Type Description Default
weights array-like, :class:`~probpipe.Weights`, or None

Non-negative sample weights (normalized internally). A pre-built Weights object is also accepted. Mutually exclusive with log_weights.

None
log_weights array-like, :class:`~probpipe.Weights`, or None

Log-unnormalized sample weights. A pre-built Weights object is also accepted. Mutually exclusive with weights.

None
name str

Distribution name.

None
**samples array - like

Named component sample arrays. Each must have the same number of rows (first dimension = n).

{}
Source code in probpipe/distributions/_joint_empirical.py
def __init__(
    self,
    *,
    weights: ArrayLike | Weights | None = None,
    log_weights: ArrayLike | Weights | None = None,
    name: str | None = None,
    **samples: ArrayLike,
):
    if not samples:
        raise ValueError("JointEmpirical requires at least one component.")

    # Generic path: store samples as-is (numpy or jax arrays). Validate
    # that all components have the same leading row count. Numeric
    # coercion happens in NumericJointEmpirical.
    stored: dict[str, Any] = {}
    n: int | None = None
    for cname, arr in samples.items():
        if not hasattr(arr, "shape") or len(arr.shape) == 0:
            raise ValueError(
                f"Component '{cname}' must have at least 1 dimension "
                f"(first dim = number of samples)."
            )
        if n is None:
            n = arr.shape[0]
        elif arr.shape[0] != n:
            raise ValueError(
                f"All components must have the same number of samples. "
                f"First component has {n}, but '{cname}' has {arr.shape[0]}."
            )
        stored[cname] = arr

    self._joint_samples = stored
    self._n = n
    if name is None:
        name = "joint_empirical(" + ",".join(samples.keys()) + ")"
    super().__init__(name=name)
    self._w = Weights(n=n, weights=weights, log_weights=log_weights)
    self._components = self._build_component_dists()
    self._record_template = (
        _build_record_template(self._components)
        if self._components is not None
        else None
    )

n property

Number of joint samples.

weights property

Normalised weights, shape (n,).

fields property

Component names in insertion order.

components property

Read-only view of the component distributions (numeric case).

NumericJointEmpirical(*, weights=None, log_weights=None, name=None, **samples)

Bases: JointEmpirical, SupportsMean, SupportsVariance

Joint empirical where every field is a numeric array.

Subclass of JointEmpirical that additionally implements SupportsMean and SupportsVariance. For a density on top of empirical samples use the converter registry (from_distribution(emp, KDEDistribution, ...)) or fit a parametric distribution.

Construction coerces every field to a floating-point JAX array (preserving float64 when JAX's x64 mode is enabled, otherwise promoting integer inputs to float32); fields that aren't numeric arrays raise TypeError. Typically constructed via JointEmpirical, which dispatches here automatically when all fields are numeric.

Source code in probpipe/distributions/_joint_empirical.py
def __init__(
    self,
    *,
    weights: ArrayLike | Weights | None = None,
    log_weights: ArrayLike | Weights | None = None,
    name: str | None = None,
    **samples: ArrayLike,
):
    if not samples:
        raise ValueError("NumericJointEmpirical requires at least one component.")

    # Coerce every field to a floating-point JAX array up front.
    # Non-numeric inputs raise ``TypeError`` with a clear message.
    coerced: dict[str, Array] = {}
    for cname, arr in samples.items():
        if not _is_numeric_array(arr):
            raise TypeError(
                f"NumericJointEmpirical: field {cname!r} must be a "
                f"numeric array, got {type(arr).__name__}. Use "
                f"JointEmpirical directly for non-numeric components."
            )
        coerced[cname] = _as_float_array(arr)

    super().__init__(
        weights=weights, log_weights=log_weights, name=name, **coerced,
    )

event_shapes property

Per-component event shapes.