Functions are Python scripts that run on the DataLab server against your files. You write them locally, wrap them with their dependencies, and upload them. Once uploaded, a function can be used as a processing step inside a pipeline. This tutorial shows you how to define, test locally, and upload a function.
Setup¶
The function signature¶
DataLab functions follow a specific convention. Input files are positional-only parameters (before the /), typed as Path. Configuration parameters are keyword-only (after the *), with default values. When the function produces a single output file, it returns a Path directly. When it produces multiple outputs, it returns a dict[str, Path].
This convention tells DataLab which arguments are file inputs and which are configuration, so it can wire files through a pipeline automatically.
def csv_to_parquet_example(input_file: Path, /) -> Path:
"""Convert a CSV measurement file to Parquet format."""
df = pd.read_csv(input_file)
output = input_file.with_suffix(".parquet")
df.to_parquet(output, index=False)
return output
Wrapping with dependencies¶
Before uploading, you wrap the function in a gfhub.Function and declare its third-party dependencies. This is similar to a requirements.txt scoped to just this function. The keys are package specs (with optional version constraints), and the values are the import statements that make those packages available inside the function body.
If you forget to declare a dependency that your function uses, gfhub.Function will raise an error immediately, before anything is uploaded.
func = gfhub.Function(
csv_to_parquet_example,
dependencies={"pandas[pyarrow]": "import pandas as pd"},
)
print(func)
# /// script
# dependencies = [
# "pandas[pyarrow]",
# ]
# ///
from __future__ import annotations
from pathlib import Path
import pandas as pd
def main(input_file: Path, /) -> Path:
"""Convert a CSV measurement file to Parquet format."""
df = pd.read_csv(input_file)
output = input_file.with_suffix(".parquet")
df.to_parquet(output, index=False)
return output
Test locally before uploading¶
You can run the function on a local file before uploading it to the server using the method .eval(). This uses uv run under the hood to install dependencies in an isolated environment and execute the function exactly as DataLab would. It is a good habit to verify your function works correctly before uploading it.
# Create a small test file
test_csv = Path("test_measurements.csv")
test_csv.write_text("wavelength_nm,insertion_loss_db\n1550,0.4\n1560,0.5\n1570,0.45\n")
result = func.eval(test_csv)
print(result)
# {"success": True, "output": "/path/to/test_measurements.parquet"}
# Clean up test files
test_csv.unlink(missing_ok=True)
Path(result["output"]).unlink(missing_ok=True)
{'success': True, 'output': PosixPath('/home/runner/work/DataLab/DataLab/crates/sdk/examples/basics/test_measurements.parquet')}
Upload the function¶
Once you are happy with the function, upload it. If a function with the same name already exists it will be updated by default. The returned object contains the function id and metadata.
{
"id": "019d63a9-aa5a-7711-b3fd-22e525fc032f",
"name": "csv_to_parquet_example",
"parameters": {},
"inputs": {
"input_file": {
"type": "Path"
}
},
"outputs": {
"0": {
"type": "Path"
}
},
"created_at": "2026-04-06T16:39:16.826931Z",
"updated_at": "2026-05-04T15:57:34.454652Z"
}
List uploaded functions¶
You can list all functions available in your organization. This is useful to confirm an upload succeeded or to find function names when referencing them inside a pipeline.
functions = client.list_functions()
print(f"Available functions ({len(functions)}):")
for f in functions:
print(f" - {f['name']} (id={f['id']})")
Available functions (22):
- Test (id=019cbe1f-7397-7913-a077-c0d6fd78f520)
- hello_world (id=019cbe50-9a35-7673-83b3-7418155f045a)
- csv2parquet (id=019d234a-1a2b-73c3-848c-a7dd997e4353)
- parquet2json (id=019d234a-7aa6-77f1-81e0-9c9c2aa49d89)
- debug_noop_5880f52e (id=019da9e2-1687-7f73-872e-66935e752a7f)
- plot_parquet (id=019d3e8c-d142-7302-b228-45cdc40cea6f)
- csv_to_parquet_example (id=019d63a9-aa5a-7711-b3fd-22e525fc032f)
- linear_fit (id=019daa71-96cf-7860-90a8-43f7ae158578)
- propagation_loss_from_cutback_spirals (id=019daa7e-51b1-7e53-b219-94d749a6406e)
- plot_ring_spectrum (id=019daa71-5d53-7fd0-9340-f3924672ba10)
- fsr_wafer_map (id=019daa7e-929e-7ef3-9de6-0be40dd97b35)
- aggregate_die_analyses (id=019daa76-4564-7592-b550-f9497b3c90ff)
- spirals_wafer_map (id=019daa7f-32d8-7c40-bf94-633ed7ef3167)
- plot_spiral_spectrum (id=019daa71-6236-7b12-a6cc-f9282d413ef5)
- ring_fsr (id=019daa71-c0c8-7f40-9cc8-e318ce916b28)
- fsr_for_radius_within_die (id=019daa78-c86c-70e1-b47c-09abe3d477e4)
- fsr_wafer_aggregation (id=019dd47e-bcf4-7dd1-b96f-f7d908a6c946)
- spiral_power_at_wavelength (id=019daa72-6b62-74a2-bf0a-c6d16c5e0620)
- find_common_tags (id=019d3e96-3049-7c30-b6aa-cb67cf5da712)
- cutback_die_analysis (id=019d3e96-0b6b-7fa3-8f5e-957818f6cfaa)
- die_sheet_resistance (id=019daa76-9580-7591-aeb4-43383905820a)
- ring_fsr_batch (id=019dd47a-0f23-7a42-8ec4-132596ec8af8)