dfchain.api package

Submodules

dfchain.api.pipeline module

Pipeline utilities for dfchain.

This module defines the Transformation protocol, a lightweight type for functions/objects that transform pandas DataFrames, and the Pipeline class that composes a sequence of such transformations and runs them using an Executor. The design intentionally keeps transformations simple: a Transformation may be any callable that accepts a DataFrame and an Executor and returns a DataFrame. This allows a transformation to be a plain function.

class dfchain.api.pipeline.Pipeline(steps: Sequence[Transformation])[source]

Bases: object

Compose and run a sequence of Transformations.

The Pipeline object holds an ordered sequence of Transformation callables and provides a single run() method which executes each step in order using the provided Executor.

steps

Ordered list of transformations.

Type:

Sequence[Transformation]

Example

>>> pipeline = Pipeline([step1, step2])
>>> pipeline.run(executor)
run(executor: Executor) None[source]

Execute the pipeline using executor.

For each transformation in steps, the pipeline applies the transformation across all chunks provided by the executor via transform(). By default, transformed chunks are written back to the executor using Executor.write_chunk() (unless the executor is configured as inplace). After each step, if the executor has a group key configured and is not in eager mode, the pipeline rebuilds group blobs via Executor.rebuild_groups().

Parameters:

executor (Executor) – Executor responsible for storing and iterating over DataFrame chunks.

steps: Sequence[Transformation]
class dfchain.api.pipeline.Transformation(*args, **kwargs)[source]

Bases: Protocol

Callable protocol representing a DataFrame transformation.

A Transformation is any callable that accepts a pandas.DataFrame, and optionally an Executor and returns a pandas.DataFrame.

Example

A transformation can be a function

import pandas as pd
from dfchain import task

@task
def add_flag(df: pd.DataFrame, _executor, threshold: float = 0.5) -> pd.DataFrame:
    df = df.copy()
    df["flag"] = df["score"] > threshold
    return df
dfchain.api.pipeline.transform(executor: Executor, f: Transformation) Iterator[DataFrame][source]

Apply a transformation across the executor’s stored chunks.

This helper iterates executor.iter_chunks(), applying the provided transformation f to each chunk. If the executor is configured as eager and a group key is present, the resulting fragment will be immediately merged into the stored groups via Executor.update_group().

Parameters:
  • executor (Executor) – Executor providing chunk iteration.

  • f (Transformation) – Transformation to apply to each chunk.

Yields:

Iterator[tuple[int, pandas.DataFrame]]

Pairs of (chunk_id, DataFrame)

with the transformed DataFrame for each chunk.

dfchain.api.read_csv module

dfchain.api.read_csv.get_default_executor()[source]
dfchain.api.read_csv.read_csv(*ac, **av)[source]

dfchain.api.task module

Task decorator utilities.

This module provides the task decorator adapter used to wrap simple pandas DataFrame-processing functions so they can be used as Transformations in the dfchain.api.pipeline Pipeline.

The adapter supports two input shapes: - A single pandas.DataFrame -> pandas.DataFrame function (applied to one

partition), and

  • An iterator of (idx, DataFrame) pairs -> iterator of DataFrames

The returned wrapper presents a consistent signature accepted by the Pipeline/Executor runtime: it accepts either a DataFrame or an iterator of (partition id, DataFrame) pairs and returns either a modified DataFrame or an iterator of DataFrames respectively.

Docstrings follow the Google style for compatibility with Sphinx napoleon.

dfchain.api.task.is_df_iterator(x) bool[source]

Return True when x is an iterator of DataFrames (not a single DF).

Parameters:

x – Candidate object to test.

Returns:

True if x appears to be an iterator yielding DataFrames.

Return type:

bool

dfchain.api.task.task(func: Callable[[DataFrame], DataFrame]) Callable[[DataFrame], DataFrame][source]
dfchain.api.task.task(func: Callable[[DataFrame], DataFrame]) Callable[[Iterator[DataFrame]], Iterator[DataFrame]]

Decorator adapter that makes simple DataFrame functions compatible with Pipeline runtime.

The original function may accept a pandas.DataFrame and optional keyword arguments. If the function accepts an _executor parameter, the wrapper will pass the Executor object through so tasks can inspect runtime configuration if necessary.

Parameters:

func – User-provided transformation callable accepting a DataFrame.

Returns:

Wrapped function matching the Pipeline/Executor expected signature.

Return type:

Callable

Module contents