dfchain.api package
Submodules
dfchain.api.pipeline module
Pipeline utilities for dfchain.
This module defines the Transformation protocol, a lightweight type for functions/objects that transform pandas DataFrames, and the Pipeline class that composes a sequence of such transformations and runs them using an Executor. The design intentionally keeps transformations simple: a Transformation may be any callable that accepts a DataFrame and an Executor and returns a DataFrame. This allows a transformation to be a plain function.
- class dfchain.api.pipeline.Pipeline(steps: Sequence[Transformation])[source]
Bases:
objectCompose and run a sequence of Transformations.
The Pipeline object holds an ordered sequence of Transformation callables and provides a single
run()method which executes each step in order using the providedExecutor.- steps
Ordered list of transformations.
- Type:
Sequence[Transformation]
Example
>>> pipeline = Pipeline([step1, step2]) >>> pipeline.run(executor)
- run(executor: Executor) None[source]
Execute the pipeline using
executor.For each transformation in
steps, the pipeline applies the transformation across all chunks provided by the executor viatransform(). By default, transformed chunks are written back to the executor usingExecutor.write_chunk()(unless the executor is configured as inplace). After each step, if the executor has a group key configured and is not in eager mode, the pipeline rebuilds group blobs viaExecutor.rebuild_groups().- Parameters:
executor (Executor) – Executor responsible for storing and iterating over DataFrame chunks.
- steps: Sequence[Transformation]
- class dfchain.api.pipeline.Transformation(*args, **kwargs)[source]
Bases:
ProtocolCallable protocol representing a DataFrame transformation.
A Transformation is any callable that accepts a pandas.DataFrame, and optionally an Executor and returns a pandas.DataFrame.
Example
A transformation can be a function
import pandas as pd from dfchain import task @task def add_flag(df: pd.DataFrame, _executor, threshold: float = 0.5) -> pd.DataFrame: df = df.copy() df["flag"] = df["score"] > threshold return df
- dfchain.api.pipeline.transform(executor: Executor, f: Transformation) Iterator[DataFrame][source]
Apply a transformation across the executor’s stored chunks.
This helper iterates executor.iter_chunks(), applying the provided transformation
fto each chunk. If the executor is configured as eager and a group key is present, the resulting fragment will be immediately merged into the stored groups viaExecutor.update_group().- Parameters:
executor (Executor) – Executor providing chunk iteration.
f (Transformation) – Transformation to apply to each chunk.
- Yields:
Iterator[tuple[int, pandas.DataFrame]] –
- Pairs of (chunk_id, DataFrame)
with the transformed DataFrame for each chunk.
dfchain.api.read_csv module
dfchain.api.task module
Task decorator utilities.
This module provides the task decorator adapter used to wrap simple
pandas DataFrame-processing functions so they can be used as
Transformations in the dfchain.api.pipeline Pipeline.
The adapter supports two input shapes: - A single pandas.DataFrame -> pandas.DataFrame function (applied to one
partition), and
An iterator of (idx, DataFrame) pairs -> iterator of DataFrames
The returned wrapper presents a consistent signature accepted by the Pipeline/Executor runtime: it accepts either a DataFrame or an iterator of (partition id, DataFrame) pairs and returns either a modified DataFrame or an iterator of DataFrames respectively.
Docstrings follow the Google style for compatibility with Sphinx napoleon.
- dfchain.api.task.is_df_iterator(x) bool[source]
Return True when
xis an iterator of DataFrames (not a single DF).- Parameters:
x – Candidate object to test.
- Returns:
True if
xappears to be an iterator yielding DataFrames.- Return type:
bool
- dfchain.api.task.task(func: Callable[[DataFrame], DataFrame]) Callable[[DataFrame], DataFrame][source]
- dfchain.api.task.task(func: Callable[[DataFrame], DataFrame]) Callable[[Iterator[DataFrame]], Iterator[DataFrame]]
Decorator adapter that makes simple DataFrame functions compatible with Pipeline runtime.
The original function may accept a pandas.DataFrame and optional keyword arguments. If the function accepts an
_executorparameter, the wrapper will pass the Executor object through so tasks can inspect runtime configuration if necessary.- Parameters:
func – User-provided transformation callable accepting a DataFrame.
- Returns:
Wrapped function matching the Pipeline/Executor expected signature.
- Return type:
Callable