Datasets API

REST API endpoints for managing datasets in Providence

The Datasets API provides endpoints for managing datasets within projects. Datasets are the core data sources that users can query using natural language.

Overview

Datasets in Providence can be:

  • File-based: CSV, JSON, Parquet files uploaded to the system
  • Database-connected: Live connections to external databases
  • Schema-indexed: Automatically indexed for semantic search

Authentication

All dataset endpoints require authentication via JWT token:

Authorization: Bearer <your-jwt-token>

Endpoints

List Datasets

Get all datasets in a project.

GET /api/v1/projects/{project_id}/datasets

Parameters

NameTypeLocationRequiredDescription
project_idstringpathYesProject ID
pageintegerqueryNoPage number (default: 1)
limitintegerqueryNoItems per page (default: 20)
searchstringqueryNoSearch in dataset names
typestringqueryNoFilter by type (file, database)

Response

{
  "datasets": [
    {
      "id": "ds_1234567890",
      "name": "sales_data_2024",
      "description": "Annual sales data for 2024",
      "type": "file",
      "source": {
        "type": "csv",
        "filename": "sales_2024.csv",
        "size": 1048576,
        "row_count": 10000,
        "column_count": 15
      },
      "schema": {
        "tables": [
          {
            "name": "sales_data_2024",
            "columns": [
              {
                "name": "order_id",
                "type": "VARCHAR",
                "nullable": false
              },
              {
                "name": "customer_id",
                "type": "VARCHAR",
                "nullable": false
              },
              {
                "name": "amount",
                "type": "DECIMAL(10,2)",
                "nullable": false
              }
            ]
          }
        ]
      },
      "created_at": "2024-01-15T10:30:00Z",
      "updated_at": "2024-01-15T10:30:00Z",
      "indexed_at": "2024-01-15T10:31:00Z",
      "status": "active"
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 20,
    "total": 45,
    "pages": 3
  }
}

Create Dataset

Create a new dataset in a project.

POST /api/v1/projects/{project_id}/datasets

Request Body

{
  "name": "customer_data",
  "description": "Customer information and demographics",
  "type": "file",
  "source": {
    "type": "csv",
    "file_id": "file_abc123"  // From file upload endpoint
  }
}

Response

{
  "id": "ds_0987654321",
  "name": "customer_data",
  "description": "Customer information and demographics",
  "type": "file",
  "status": "processing",
  "created_at": "2024-01-20T14:00:00Z"
}

Get Dataset

Get details of a specific dataset.

GET /api/v1/projects/{project_id}/datasets/{dataset_id}

Response

{
  "id": "ds_1234567890",
  "name": "sales_data_2024",
  "description": "Annual sales data for 2024",
  "type": "file",
  "source": {
    "type": "csv",
    "filename": "sales_2024.csv",
    "size": 1048576,
    "row_count": 10000,
    "column_count": 15,
    "s3_key": "projects/proj_123/datasets/ds_1234567890/data.csv"
  },
  "schema": {
    "tables": [
      {
        "name": "sales_data_2024",
        "description": "Main sales transactions table",
        "columns": [
          {
            "name": "order_id",
            "type": "VARCHAR",
            "nullable": false,
            "description": "Unique order identifier",
            "statistics": {
              "distinct_count": 10000,
              "null_count": 0
            }
          },
          {
            "name": "amount",
            "type": "DECIMAL(10,2)",
            "nullable": false,
            "description": "Order amount in USD",
            "statistics": {
              "min": 10.50,
              "max": 9999.99,
              "avg": 156.78,
              "null_count": 0
            }
          }
        ]
      }
    ]
  },
  "metadata": {
    "upload_time": "2024-01-15T10:30:00Z",
    "processing_time_ms": 3500,
    "index_time_ms": 1200,
    "vector_count": 45
  },
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-15T10:30:00Z",
  "indexed_at": "2024-01-15T10:31:00Z",
  "status": "active"
}

Update Dataset

Update dataset metadata.

PATCH /api/v1/projects/{project_id}/datasets/{dataset_id}

Request Body

{
  "name": "sales_data_2024_updated",
  "description": "Updated annual sales data for 2024 with corrections"
}

Delete Dataset

Delete a dataset and all associated data.

DELETE /api/v1/projects/{project_id}/datasets/{dataset_id}

Response

{
  "message": "Dataset deleted successfully",
  "deleted_at": "2024-01-20T15:30:00Z"
}

Upload File for Dataset

Upload a file to create a dataset.

POST /api/v1/projects/{project_id}/datasets/upload

Request

Multipart form data:

  • file: The file to upload (CSV, JSON, Parquet)
  • name: Dataset name
  • description: Dataset description (optional)

Response

{
  "file_id": "file_abc123",
  "filename": "customers.csv",
  "size": 524288,
  "type": "csv",
  "upload_url": "https://storage.providence.io/temp/file_abc123",
  "expires_at": "2024-01-20T16:00:00Z"
}

Get Dataset Schema

Get detailed schema information for a dataset.

GET /api/v1/projects/{project_id}/datasets/{dataset_id}/schema

Response

{
  "dataset_id": "ds_1234567890",
  "tables": [
    {
      "name": "sales_data_2024",
      "type": "table",
      "row_count": 10000,
      "columns": [
        {
          "name": "order_id",
          "type": "VARCHAR",
          "nullable": false,
          "is_primary_key": true,
          "description": "Unique order identifier",
          "sample_values": ["ORD-2024-0001", "ORD-2024-0002", "ORD-2024-0003"],
          "statistics": {
            "distinct_count": 10000,
            "null_count": 0,
            "completeness": 1.0
          }
        }
      ],
      "relationships": [
        {
          "type": "foreign_key",
          "from_column": "customer_id",
          "to_table": "customers",
          "to_column": "id"
        }
      ]
    }
  ],
  "metadata": {
    "last_analyzed": "2024-01-15T10:31:00Z",
    "analysis_version": "1.0"
  }
}

Preview Dataset

Get a preview of dataset contents.

GET /api/v1/projects/{project_id}/datasets/{dataset_id}/preview

Parameters

NameTypeLocationRequiredDescription
limitintegerqueryNoNumber of rows (default: 100)
offsetintegerqueryNoRow offset (default: 0)
tablestringqueryNoTable name for multi-table datasets

Response

{
  "table": "sales_data_2024",
  "columns": ["order_id", "customer_id", "amount", "order_date"],
  "rows": [
    ["ORD-2024-0001", "CUST-001", 156.99, "2024-01-01"],
    ["ORD-2024-0002", "CUST-002", 89.50, "2024-01-01"],
    ["ORD-2024-0003", "CUST-001", 245.00, "2024-01-02"]
  ],
  "row_count": 3,
  "total_rows": 10000
}

Refresh Dataset

Refresh a dataset (re-index schema, update statistics).

POST /api/v1/projects/{project_id}/datasets/{dataset_id}/refresh

Response

{
  "job_id": "job_xyz789",
  "status": "started",
  "started_at": "2024-01-20T16:00:00Z",
  "estimated_duration_seconds": 30
}

Database Datasets

Create Database Dataset

Create a dataset from a database connection.

POST /api/v1/projects/{project_id}/datasets/database

Request Body

{
  "name": "production_analytics",
  "description": "Production analytics database",
  "connection": {
    "type": "postgresql",
    "host": "analytics.db.company.com",
    "port": 5432,
    "database": "analytics",
    "username": "readonly_user",
    "password": "secure_password",
    "ssl_mode": "require"
  },
  "options": {
    "schemas": ["public", "analytics"],
    "exclude_tables": ["temp_*", "staging_*"],
    "sample_rows": 1000
  }
}

Test Database Connection

Test a database connection before creating a dataset.

POST /api/v1/projects/{project_id}/datasets/database/test

Request Body

Same as database dataset creation.

Response

{
  "success": true,
  "message": "Connection successful",
  "details": {
    "version": "PostgreSQL 14.5",
    "schemas_found": ["public", "analytics", "staging"],
    "table_count": 45,
    "accessible_tables": 42
  }
}

Error Responses

400 Bad Request

{
  "error": {
    "code": "INVALID_FILE_FORMAT",
    "message": "Unsupported file format. Supported formats: CSV, JSON, Parquet",
    "details": {
      "provided_format": "xlsx",
      "supported_formats": ["csv", "json", "parquet"]
    }
  }
}

404 Not Found

{
  "error": {
    "code": "DATASET_NOT_FOUND",
    "message": "Dataset not found",
    "details": {
      "dataset_id": "ds_nonexistent"
    }
  }
}

413 Payload Too Large

{
  "error": {
    "code": "FILE_TOO_LARGE",
    "message": "File size exceeds maximum allowed size",
    "details": {
      "provided_size": 5368709120,
      "max_size": 1073741824,
      "max_size_human": "1GB"
    }
  }
}

Webhooks

Dataset events can trigger webhooks:

  • dataset.created
  • dataset.processing
  • dataset.ready
  • dataset.failed
  • dataset.deleted

See the Webhooks documentation for details.

Best Practices

  1. File Uploads: Use chunked uploads for large files
  2. Schema Indexing: Allow time for indexing to complete before querying
  3. Database Connections: Use read-only credentials when possible
  4. Refresh Strategy: Schedule regular refreshes for database datasets
  5. Error Handling: Implement exponential backoff for retries

Rate Limits

  • File uploads: 10 per hour per project
  • Dataset creation: 50 per hour per project
  • Schema refresh: 100 per hour per project

See Rate Limiting for more details.