cloudflare
references/r2-data-catalog/patterns.md

.md 192 lines
Content
# Common Patterns

Practical patterns for R2 Data Catalog with PyIceberg.

## PyIceberg Connection

```python
import os
from pyiceberg.catalog.rest import RestCatalog
from pyiceberg.exceptions import NamespaceAlreadyExistsError

catalog = RestCatalog(
    name="r2_catalog",
    warehouse=os.getenv("R2_WAREHOUSE"),      # bucket name
    uri=os.getenv("R2_CATALOG_URI"),          # catalog endpoint
    token=os.getenv("R2_TOKEN"),              # API token
)

# Create namespace (idempotent)
try:
    catalog.create_namespace("default")
except NamespaceAlreadyExistsError:
    pass
```

## Pattern 1: Log Analytics Pipeline

Ingest logs incrementally, query by time/level.

```python
import pyarrow as pa
from datetime import datetime
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, TimestampType, StringType, IntegerType
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import DayTransform

# Create partitioned table (once)
schema = Schema(
    NestedField(1, "timestamp", TimestampType(), required=True),
    NestedField(2, "level", StringType(), required=True),
    NestedField(3, "service", StringType(), required=True),
    NestedField(4, "message", StringType(), required=False),
)

partition_spec = PartitionSpec(
    PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day")
)

catalog.create_namespace("logs")
table = catalog.create_table(("logs", "app_logs"), schema=schema, partition_spec=partition_spec)

# Append logs (incremental)
data = pa.table({
    "timestamp": [datetime(2026, 1, 27, 10, 30, 0)],
    "level": ["ERROR"],
    "service": ["auth-service"],
    "message": ["Failed login"],
})
table.append(data)

# Query by time + level (leverages partitioning)
scan = table.scan(row_filter="level = 'ERROR' AND day = '2026-01-27'")
errors = scan.to_pandas()
```

## Pattern 2: Time-Travel Queries

```python
from datetime import datetime, timedelta

table = catalog.load_table(("logs", "app_logs"))

# Query specific snapshot
snapshot_id = table.current_snapshot().snapshot_id
data = table.scan(snapshot_id=snapshot_id).to_pandas()

# Query as of timestamp (yesterday)
yesterday_ms = int((datetime.now() - timedelta(days=1)).timestamp() * 1000)
data = table.scan(as_of_timestamp=yesterday_ms).to_pandas()
```

## Pattern 3: Schema Evolution

```python
from pyiceberg.types import StringType

table = catalog.load_table(("users", "profiles"))

with table.update_schema() as update:
    update.add_column("email", StringType(), required=False)
    update.rename_column("name", "full_name")
# Old readers ignore new columns, new readers see nulls for old data
```

## Pattern 4: Partitioned Tables

```python
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import DayTransform, IdentityTransform

# Partition by day + country
partition_spec = PartitionSpec(
    PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day"),
    PartitionField(source_id=2, field_id=1001, transform=IdentityTransform(), name="country"),
)
table = catalog.create_table(("events", "user_events"), schema=schema, partition_spec=partition_spec)

# Queries prune partitions automatically
scan = table.scan(row_filter="country = 'US' AND day = '2026-01-27'")
```

## Pattern 5: Table Maintenance

```python
from datetime import datetime, timedelta

table = catalog.load_table(("logs", "app_logs"))

# Compact → expire → cleanup (in order)
table.rewrite_data_files(target_file_size_bytes=128 * 1024 * 1024)
seven_days_ms = int((datetime.now() - timedelta(days=7)).timestamp() * 1000)
table.expire_snapshots(older_than=seven_days_ms, retain_last=10)
three_days_ms = int((datetime.now() - timedelta(days=3)).timestamp() * 1000)
table.delete_orphan_files(older_than=three_days_ms)
```

See [api.md](api.md#table-maintenance) for detailed parameters.

## Pattern 6: Concurrent Writes with Retry

```python
from pyiceberg.exceptions import CommitFailedException
import time

def append_with_retry(table, data, max_retries=3):
    for attempt in range(max_retries):
        try:
            table.append(data)
            return
        except CommitFailedException:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)
```

## Pattern 7: Upsert Simulation

```python
import pandas as pd
import pyarrow as pa

# Read → merge → overwrite (not atomic, use Spark MERGE INTO for production)
existing = table.scan().to_pandas()
new_data = pd.DataFrame({"id": [1, 3], "value": [100, 300]})
merged = pd.concat([existing, new_data]).drop_duplicates(subset=["id"], keep="last")
table.overwrite(pa.Table.from_pandas(merged))
```

## Pattern 8: DuckDB Integration

```python
import duckdb

arrow_table = table.scan().to_arrow()
con = duckdb.connect()
con.register("logs", arrow_table)
result = con.execute("SELECT level, COUNT(*) FROM logs GROUP BY level").fetchdf()
```

## Pattern 9: Monitor Table Health

```python
files = table.scan().plan_files()
avg_mb = sum(f.file_size_in_bytes for f in files) / len(files) / (1024**2)
print(f"Files: {len(files)}, Avg: {avg_mb:.1f}MB, Snapshots: {len(table.snapshots())}")

if avg_mb < 10 or len(files) > 1000:
    print("⚠️ Needs compaction")
```

## Best Practices

| Area | Guideline |
|------|-----------|
| **Partitioning** | Use day/hour for time-series; 100-1000 partitions; avoid high cardinality |
| **File sizes** | Target 128-512MB; compact when avg <10MB or >10k files |
| **Schema** | Add columns as nullable (`required=False`); batch changes |
| **Maintenance** | Compact high-write daily/weekly; expire snapshots 7-30d; cleanup orphans after |
| **Concurrency** | Reads automatic; writes to different partitions safe; retry same partition |
| **Performance** | Filter on partitions; select only needed columns; batch appends 100MB+ |