dfchain.backends.pandas package

Submodules

dfchain.backends.pandas.executor_impl module

Pandas backend executor implementation.

This module provides PandasExecutor, an in‑memory implementation of dfchain.core.executor.Executor that wraps a single :pandas:`pandas.DataFrame` instance and exposes grouping and chunking hooks used by higher‑level APIs.

class dfchain.backends.pandas.executor_impl.PandasExecutor(_groupkey: Hashable | None = None, _df: DataFrameLike | None = None, is_eager: bool = False, is_inplace: bool = False, chunksize: int | None = None)[source]

Bases: Executor

Executor implementation backed by an in‑memory pandas.DataFrame.

PandasExecutor is a lightweight in‑memory executor that wraps a pandas.DataFrame and implements the grouping and chunking hooks defined by dfchain.core.executor.PartitionAble.

_df

The wrapped dataframe. It can be provided at construction time or later via df().

Type:: pandas.DataFrame or None, default None

is_eager

Hint for task execution mode. When True, tasks may execute eagerly rather than building a deferred plan. The exact semantics are defined by higher‑level APIs.

Type:: bool, default False

is_inplace

When True, task functions are expected to mutate _df in place. When False, tasks should treat _df as immutable and reassign a new dataframe instead.

Type:: bool, default False

chunksize

Optional hint used by higher‑level code to determine how many rows to process per chunk when streaming or partitioning the data.

Type:: int or None, default None

Note

The pandas backend is designed for in‑memory use and does not maintain an index by group key. As a result, methods that would write changes back to specific groups (update_group, clear_groups, rebuild_groups) raise NotImplementedError.

clear_groups() → None[source]

Clear any cached grouping state.

The pandas executor does not cache grouped state keyed by a group index, so this method raises NotImplementedError.

df(df)[source]

Set the wrapped dataframe and return self for fluent construction.

Parameters:: df (pandas.DataFrame) – Dataframe to wrap.
Returns:: The executor instance.
Return type:: self (PandasExecutor)

iter_chunks() → Iterable[DataFrame][source]

Iterate dataframe chunks.

The default in‑memory strategy yields a single chunk containing the entire dataframe. Callers that require more advanced chunking behaviour should subclass PandasExecutor or use a different Executor implementation.

Note

The default implementation yields a (key, chunk) pair where the key is 0 and chunk is the full dataframe. Higher‑level code should account for this convention when consuming the iterator.

iter_groups() → Iterable[tuple[Hashable, DataFrame]][source]

Iterate grouped data as (key, group_df) pairs.

If _groupkey is None, yield a single pair (None, self._df) containing the whole dataframe. Otherwise, perform self._df.groupby(self._groupkey) and yield the resulting (key, group) pairs produced by pandas.

rebuild_groups(flush_every: int = 1)[source]

Rebuild or re‑materialize groups.

Parameters:: flush_every (int, optional) – Hint controlling how often to flush intermediate state. Not implemented for the in‑memory pandas backend.

update_group(df: DataFrame) → None[source]

Update the current group with the provided dataframe.

The pandas backend does not maintain an index by group key, so there is no safe default way to update a single group in place. This method therefore raises NotImplementedError. Backends that support indexed group updates (for example, a database backend) should provide an implementation.

write_chunk(key, val)[source]

Write a processed chunk back to the executor.

The pandas backend treats the entire dataframe as one chunk; this method therefore replaces self._df with val and ignores the provided key.

Module contents

Pandas backend for dfchain.

This subpackage provides an Executor implementation backed by an in‑memory pandas.DataFrame.

The main entry point is PandasExecutor, which wraps a single dataframe and exposes the generic executor interface used throughout dfchain.

Typical usage:

import pandas as pd
from dfchain.backends.pandas import PandasExecutor

df = pd.DataFrame({"x": [1, 2, 3], "y": [10, 20, 30]})
ex = PandasExecutor().df(df).build()

# iterate as a single group
for key, group in ex.iter_groups():
    ...