Datasets API
REST API endpoints for managing datasets in Providence
The Datasets API provides endpoints for managing datasets within projects. Datasets are the core data sources that users can query using natural language.
Overview
Datasets in Providence can be:
- File-based: CSV, JSON, Parquet files uploaded to the system
- Database-connected: Live connections to external databases
- Schema-indexed: Automatically indexed for semantic search
Authentication
All dataset endpoints require authentication via JWT token:
Authorization: Bearer <your-jwt-token>Endpoints
List Datasets
Get all datasets in a project.
GET /api/v1/projects/{project_id}/datasetsParameters
| Name | Type | Location | Required | Description |
|---|---|---|---|---|
| project_id | string | path | Yes | Project ID |
| page | integer | query | No | Page number (default: 1) |
| limit | integer | query | No | Items per page (default: 20) |
| search | string | query | No | Search in dataset names |
| type | string | query | No | Filter by type (file, database) |
Response
{
"datasets": [
{
"id": "ds_1234567890",
"name": "sales_data_2024",
"description": "Annual sales data for 2024",
"type": "file",
"source": {
"type": "csv",
"filename": "sales_2024.csv",
"size": 1048576,
"row_count": 10000,
"column_count": 15
},
"schema": {
"tables": [
{
"name": "sales_data_2024",
"columns": [
{
"name": "order_id",
"type": "VARCHAR",
"nullable": false
},
{
"name": "customer_id",
"type": "VARCHAR",
"nullable": false
},
{
"name": "amount",
"type": "DECIMAL(10,2)",
"nullable": false
}
]
}
]
},
"created_at": "2024-01-15T10:30:00Z",
"updated_at": "2024-01-15T10:30:00Z",
"indexed_at": "2024-01-15T10:31:00Z",
"status": "active"
}
],
"pagination": {
"page": 1,
"limit": 20,
"total": 45,
"pages": 3
}
}Create Dataset
Create a new dataset in a project.
POST /api/v1/projects/{project_id}/datasetsRequest Body
{
"name": "customer_data",
"description": "Customer information and demographics",
"type": "file",
"source": {
"type": "csv",
"file_id": "file_abc123" // From file upload endpoint
}
}Response
{
"id": "ds_0987654321",
"name": "customer_data",
"description": "Customer information and demographics",
"type": "file",
"status": "processing",
"created_at": "2024-01-20T14:00:00Z"
}Get Dataset
Get details of a specific dataset.
GET /api/v1/projects/{project_id}/datasets/{dataset_id}Response
{
"id": "ds_1234567890",
"name": "sales_data_2024",
"description": "Annual sales data for 2024",
"type": "file",
"source": {
"type": "csv",
"filename": "sales_2024.csv",
"size": 1048576,
"row_count": 10000,
"column_count": 15,
"s3_key": "projects/proj_123/datasets/ds_1234567890/data.csv"
},
"schema": {
"tables": [
{
"name": "sales_data_2024",
"description": "Main sales transactions table",
"columns": [
{
"name": "order_id",
"type": "VARCHAR",
"nullable": false,
"description": "Unique order identifier",
"statistics": {
"distinct_count": 10000,
"null_count": 0
}
},
{
"name": "amount",
"type": "DECIMAL(10,2)",
"nullable": false,
"description": "Order amount in USD",
"statistics": {
"min": 10.50,
"max": 9999.99,
"avg": 156.78,
"null_count": 0
}
}
]
}
]
},
"metadata": {
"upload_time": "2024-01-15T10:30:00Z",
"processing_time_ms": 3500,
"index_time_ms": 1200,
"vector_count": 45
},
"created_at": "2024-01-15T10:30:00Z",
"updated_at": "2024-01-15T10:30:00Z",
"indexed_at": "2024-01-15T10:31:00Z",
"status": "active"
}Update Dataset
Update dataset metadata.
PATCH /api/v1/projects/{project_id}/datasets/{dataset_id}Request Body
{
"name": "sales_data_2024_updated",
"description": "Updated annual sales data for 2024 with corrections"
}Delete Dataset
Delete a dataset and all associated data.
DELETE /api/v1/projects/{project_id}/datasets/{dataset_id}Response
{
"message": "Dataset deleted successfully",
"deleted_at": "2024-01-20T15:30:00Z"
}Upload File for Dataset
Upload a file to create a dataset.
POST /api/v1/projects/{project_id}/datasets/uploadRequest
Multipart form data:
file: The file to upload (CSV, JSON, Parquet)name: Dataset namedescription: Dataset description (optional)
Response
{
"file_id": "file_abc123",
"filename": "customers.csv",
"size": 524288,
"type": "csv",
"upload_url": "https://storage.providence.io/temp/file_abc123",
"expires_at": "2024-01-20T16:00:00Z"
}Get Dataset Schema
Get detailed schema information for a dataset.
GET /api/v1/projects/{project_id}/datasets/{dataset_id}/schemaResponse
{
"dataset_id": "ds_1234567890",
"tables": [
{
"name": "sales_data_2024",
"type": "table",
"row_count": 10000,
"columns": [
{
"name": "order_id",
"type": "VARCHAR",
"nullable": false,
"is_primary_key": true,
"description": "Unique order identifier",
"sample_values": ["ORD-2024-0001", "ORD-2024-0002", "ORD-2024-0003"],
"statistics": {
"distinct_count": 10000,
"null_count": 0,
"completeness": 1.0
}
}
],
"relationships": [
{
"type": "foreign_key",
"from_column": "customer_id",
"to_table": "customers",
"to_column": "id"
}
]
}
],
"metadata": {
"last_analyzed": "2024-01-15T10:31:00Z",
"analysis_version": "1.0"
}
}Preview Dataset
Get a preview of dataset contents.
GET /api/v1/projects/{project_id}/datasets/{dataset_id}/previewParameters
| Name | Type | Location | Required | Description |
|---|---|---|---|---|
| limit | integer | query | No | Number of rows (default: 100) |
| offset | integer | query | No | Row offset (default: 0) |
| table | string | query | No | Table name for multi-table datasets |
Response
{
"table": "sales_data_2024",
"columns": ["order_id", "customer_id", "amount", "order_date"],
"rows": [
["ORD-2024-0001", "CUST-001", 156.99, "2024-01-01"],
["ORD-2024-0002", "CUST-002", 89.50, "2024-01-01"],
["ORD-2024-0003", "CUST-001", 245.00, "2024-01-02"]
],
"row_count": 3,
"total_rows": 10000
}Refresh Dataset
Refresh a dataset (re-index schema, update statistics).
POST /api/v1/projects/{project_id}/datasets/{dataset_id}/refreshResponse
{
"job_id": "job_xyz789",
"status": "started",
"started_at": "2024-01-20T16:00:00Z",
"estimated_duration_seconds": 30
}Database Datasets
Create Database Dataset
Create a dataset from a database connection.
POST /api/v1/projects/{project_id}/datasets/databaseRequest Body
{
"name": "production_analytics",
"description": "Production analytics database",
"connection": {
"type": "postgresql",
"host": "analytics.db.company.com",
"port": 5432,
"database": "analytics",
"username": "readonly_user",
"password": "secure_password",
"ssl_mode": "require"
},
"options": {
"schemas": ["public", "analytics"],
"exclude_tables": ["temp_*", "staging_*"],
"sample_rows": 1000
}
}Test Database Connection
Test a database connection before creating a dataset.
POST /api/v1/projects/{project_id}/datasets/database/testRequest Body
Same as database dataset creation.
Response
{
"success": true,
"message": "Connection successful",
"details": {
"version": "PostgreSQL 14.5",
"schemas_found": ["public", "analytics", "staging"],
"table_count": 45,
"accessible_tables": 42
}
}Error Responses
400 Bad Request
{
"error": {
"code": "INVALID_FILE_FORMAT",
"message": "Unsupported file format. Supported formats: CSV, JSON, Parquet",
"details": {
"provided_format": "xlsx",
"supported_formats": ["csv", "json", "parquet"]
}
}
}404 Not Found
{
"error": {
"code": "DATASET_NOT_FOUND",
"message": "Dataset not found",
"details": {
"dataset_id": "ds_nonexistent"
}
}
}413 Payload Too Large
{
"error": {
"code": "FILE_TOO_LARGE",
"message": "File size exceeds maximum allowed size",
"details": {
"provided_size": 5368709120,
"max_size": 1073741824,
"max_size_human": "1GB"
}
}
}Webhooks
Dataset events can trigger webhooks:
dataset.createddataset.processingdataset.readydataset.faileddataset.deleted
See the Webhooks documentation for details.
Best Practices
- File Uploads: Use chunked uploads for large files
- Schema Indexing: Allow time for indexing to complete before querying
- Database Connections: Use read-only credentials when possible
- Refresh Strategy: Schedule regular refreshes for database datasets
- Error Handling: Implement exponential backoff for retries
Rate Limits
- File uploads: 10 per hour per project
- Dataset creation: 50 per hour per project
- Schema refresh: 100 per hour per project
See Rate Limiting for more details.