dfchain.backends.pandas package

Submodules

dfchain.backends.pandas.executor_impl module

Pandas backend executor implementation.

This module provides PandasExecutor, an in‑memory implementation of dfchain.core.executor.Executor that wraps a single :pandas:`pandas.DataFrame` instance and exposes grouping and chunking hooks used by higher‑level APIs.

class dfchain.backends.pandas.executor_impl.PandasExecutor(_groupkey: Hashable | None = None, _df: DataFrameLike | None = None, is_eager: bool = False, is_inplace: bool = False, chunksize: int | None = None)[source]

Bases: Executor

Executor implementation backed by an in‑memory pandas.DataFrame.

PandasExecutor is a lightweight in‑memory executor that wraps a pandas.DataFrame and implements the grouping and chunking hooks defined by dfchain.core.executor.PartitionAble.

_df

The wrapped dataframe. It can be provided at construction time or later via df().

Type:

pandas.DataFrame or None, default None

is_eager

Hint for task execution mode. When True, tasks may execute eagerly rather than building a deferred plan. The exact semantics are defined by higher‑level APIs.

Type:

bool, default False

is_inplace

When True, task functions are expected to mutate _df in place. When False, tasks should treat _df as immutable and reassign a new dataframe instead.

Type:

bool, default False

chunksize

Optional hint used by higher‑level code to determine how many rows to process per chunk when streaming or partitioning the data.

Type:

int or None, default None

Note

The pandas backend is designed for in‑memory use and does not maintain an index by group key. As a result, methods that would write changes back to specific groups (update_group, clear_groups, rebuild_groups) raise NotImplementedError.

clear_groups() None[source]

Clear any cached grouping state.

The pandas executor does not cache grouped state keyed by a group index, so this method raises NotImplementedError.

df(df)[source]

Set the wrapped dataframe and return self for fluent construction.

Parameters:

df (pandas.DataFrame) – Dataframe to wrap.

Returns:

The executor instance.

Return type:

self (PandasExecutor)

iter_chunks() Iterable[DataFrame][source]

Iterate dataframe chunks.

The default in‑memory strategy yields a single chunk containing the entire dataframe. Callers that require more advanced chunking behaviour should subclass PandasExecutor or use a different Executor implementation.

Note

The default implementation yields a (key, chunk) pair where the key is 0 and chunk is the full dataframe. Higher‑level code should account for this convention when consuming the iterator.

iter_groups() Iterable[tuple[Hashable, DataFrame]][source]

Iterate grouped data as (key, group_df) pairs.

If _groupkey is None, yield a single pair (None, self._df) containing the whole dataframe. Otherwise, perform self._df.groupby(self._groupkey) and yield the resulting (key, group) pairs produced by pandas.

rebuild_groups(flush_every: int = 1)[source]

Rebuild or re‑materialize groups.

Parameters:

flush_every (int, optional) – Hint controlling how often to flush intermediate state. Not implemented for the in‑memory pandas backend.

update_group(df: DataFrame) None[source]

Update the current group with the provided dataframe.

The pandas backend does not maintain an index by group key, so there is no safe default way to update a single group in place. This method therefore raises NotImplementedError. Backends that support indexed group updates (for example, a database backend) should provide an implementation.

write_chunk(key, val)[source]

Write a processed chunk back to the executor.

The pandas backend treats the entire dataframe as one chunk; this method therefore replaces self._df with val and ignores the provided key.

Module contents

Pandas backend for dfchain.

This subpackage provides an Executor implementation backed by an in‑memory pandas.DataFrame.

The main entry point is PandasExecutor, which wraps a single dataframe and exposes the generic executor interface used throughout dfchain.

Typical usage:

import pandas as pd
from dfchain.backends.pandas import PandasExecutor

df = pd.DataFrame({"x": [1, 2, 3], "y": [10, 20, 30]})
ex = PandasExecutor().df(df).build()

# iterate as a single group
for key, group in ex.iter_groups():
    ...