upload data

Before you can query, analyze, or process anything in DataLab, you need to get your files in. add_file handles uploading from a local file path, a pandas DataFrame, or any in-memory binary stream. You can attach tags at upload time so the file is immediately queryable with no separate step needed. The returned object includes the file id, which is what you will use in later tutorials to download, delete, or feed the file into a pipeline.

Setup¶

import gfhub
import pandas as pd

# Reads host and API key from ~/.gdsfactory/gdsfactoryplus.toml or environment variables.
# You can also pass them explicitly: gfhub.Client(host="http://...", api_key="...")
client = gfhub.Client()

Upload from a file path¶

If the file already exists on disk, whether it is a GDS layout, a raw measurement CSV, or a simulation output, just pass the path. DataLab preserves the original filename so there is nothing else to configure.

# Create a small example CSV for the tutorial. In real use you would already
# have a file on disk (a measurement dump, a GDS layout, etc.).
pd.DataFrame({
    "wavelength_nm": [1510, 1520, 1530, 1540, 1550],
    "insertion_loss_db": [0.42, 0.38, 0.35, 0.37, 0.40],
}).to_csv("measurement_example.csv", index=False)

result = client.add_file("measurement_example.csv")
print(result)

{'id': '019df3b5-7610-7a22-998a-58a8fb124f99', 'name': 'measurement_example.csv', 'original_name': 'measurement_example.csv', 'mime_type': 'text/csv', 'status': 'Available'}

Upload with tags¶

Tags are how you organize files so you can retrieve them later with query_files.

You will typically want to tag a file with what it represents: whether it is raw or processed data, which wafer it came from, which die, which process run.

There are two kinds of tags. - Simple tags are plain labels like "raw" or "reviewed". - Parameter tags carry a value, written as "key:value", for example "wafer_id:wafer1" or "die_id:A3".

Extension tags like .csv or .gds are added automatically based on the file type, so you do not need to include those.

result = client.add_file(
    "measurement_example.csv",
    tags=["raw", "wafer_id:wafer1", "die_id:A3"],
)

print(f"Uploaded: {result['original_name']}")
print(f"File ID:  {result['id']}")

Uploaded: measurement_example.csv
File ID:  019df3b5-78c6-7052-94b5-b653864fa539

Upload a DataFrame¶

You will often have tabular data in memory, like measurement sweeps, extracted device parameters, or simulation results, that you want to store in DataLab without writing a file to disk first. Pass the DataFrame directly and DataLab converts it to Parquet automatically. You just need to provide a filename so the file has a meaningful name on the server.

df = pd.DataFrame({
    "wavelength_nm": [1510, 1520, 1530, 1540, 1550],
    "insertion_loss_db": [0.42, 0.38, 0.35, 0.37, 0.40],
})

result = client.add_file(
    df,
    filename="sweep_results",  # .parquet extension is added automatically
    tags=["processed", "wafer_id:wafer1"],
)

print(f"Uploaded: {result['original_name']}")
print(f"File ID:  {result['id']}")

Uploaded: sweep_results.parquet
File ID:  019df3b5-7a04-7562-a985-fc72105b3e41

Upload from a file-like object¶

If you are generating file content programmatically and do not want to write it to disk first, you can pass any binary stream directly. This is useful when you are building automated workflows where files are produced in memory and should be stored immediately without touching the filesystem. Imagine we generate some simulation results for the losses of two ports:

import io

# Example: upload CSV content from memory
csv_content = b"port,loss_db\nout1,0.5\nout2,0.7\n"
buffer = io.BytesIO(csv_content)

result = client.add_file(
    buffer,
    filename="port_losses.csv",
    tags=["test"],
)

print(f"Uploaded: {result['original_name']}")
print(f"File ID:  {result['id']}")

Uploaded: port_losses.csv
File ID:  019df3b5-7b46-77d2-bcfd-1c51600bcb44

The upload result¶

Every add_file call returns a lightweight object confirming the upload. It includes the id, name, mime_type, and status, but not the full file metadata like tags or pipelines. To get the complete file object, query it by ID after uploading using query_files.

import json

print(json.dumps(result, indent=2, default=str))

{
  "id": "019df3b5-7b46-77d2-bcfd-1c51600bcb44",
  "name": "port_losses.csv",
  "original_name": "port_losses.csv",
  "mime_type": "application/octet-stream",
  "status": "Available"
}

Delete a file¶

If you uploaded a file by mistake or want to clean up test data, you can delete it using its id. Deletion is permanent and the file cannot be recovered, so make sure you have the right id before calling this.

file_id = result["id"]

client.delete_file(file_id)
print(f"Deleted file {file_id}")

Deleted file 019df3b5-7b46-77d2-bcfd-1c51600bcb44