Data Ingestion Guide¶

The KBase Data Lakehouse supports multiple mechanisms for ingesting biological and environmental data, ranging from terabyte-scale batch transfers to real-time event streams.

KBase Data Transfer Server (DTS)¶

The primary method for high-volume data ingestion is the KBase Data Transfer Server (DTS). DTS is optimized for reliability and performance over high-latency networks.

Features¶

Parallel Transfer: Utilizes multiple streams to maximize bandwidth usage (similar to Globus).
Checkpointing: Automatically resumes interrupted transfers.
Integrity Verification: Automatic checksum calculation (MD5/SHA256) upon arrival.

Quick Start¶

To initiate a transfer using the CLI:

# Authenticate
kbase-dts auth login

# Upload a directory
kbase-dts cp -r /path/to/local/data dts://my-bucket/project-123/

Supported Data Formats¶

The Lakehouse is designed to handle common bio-formats. Automatic "Bronze-to-Silver" parsing pipelines are available for:

Genomics: FASTA, FASTQ, BAM, CRAM, VCF, GFF3
Metagenomics: BIOM, Sequence Read Archives
Mass Spectrometry: mzML, mzXML
Environmental: CSV, NetCDF, GeoTIFF

Streaming Ingestion¶

For real-time data collection (e.g., from instrument sensors), use the Kafka ingress endpoint:

{
  "topic": "instrument-readings",
  "payload": {
    "sensor_id": "temp-34",
    "value": 23.5,
    "timestamp": "2024-10-24T10:00:00Z"
  }
}

Streamed data is automatically compacted into Delta tables hourly.