File Upload Guide

This guide covers everything you need to know about uploading files to Knowhere API using presigned URLs.

Overview

When you have a local file to parse, the upload process involves two steps:

Create a job with source_type: "file" to get an upload URL
Upload the file directly to cloud storage using the presigned URL

This approach offers several benefits:

Files upload directly to storage (faster, more reliable)
Large files don't timeout your API requests
Secure, time-limited upload URLs

Step-by-Step Guide

Step 1: Create Job and Get Upload URL

curl -X POST https://api.knowhereto.ai/v1/jobs \
  -H "Authorization: Bearer $KNOWHERE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "source_type": "file",
    "file_name": "document.pdf"
  }'

Response:

{
  "job_id": "job_abc123",
  "status": "waiting-file",
  "upload_url": "https://storage.knowhereto.ai/uploads/...",
  "upload_headers": {
    "Content-Type": "application/pdf"
  },
  "created_at": "2025-01-15T10:30:00Z"
}

Step 2: Upload the File

Use HTTP PUT to upload the file binary to the upload_url:

curl -X PUT "https://storage.knowhereto.ai/uploads/..." \
  -H "Content-Type: application/pdf" \
  --data-binary @document.pdf

Use Exact Headers

You must use the headers provided in upload_headers. The presigned URL is configured to expect specific headers, and any mismatch will cause the upload to fail.

Step 3: Processing Begins Automatically

Once the upload completes (HTTP 200 or 204), processing starts automatically. The job status transitions from waiting-file to pending.

Complete Examples

Python
Node.js

import requests
from pathlib import Path

def upload_and_parse(file_path: str, api_key: str) -> str:
    """Upload a file and return the job_id."""
    
    file_path = Path(file_path)
    
    # Step 1: Create job
    response = requests.post(
        "https://api.knowhereto.ai/v1/jobs",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        },
        json={
            "source_type": "file",
            "file_name": file_path.name
        }
    )
    response.raise_for_status()
    job = response.json()
    
    # Step 2: Upload file
    with open(file_path, "rb") as f:
        upload_response = requests.put(
            job["upload_url"],
            headers=job.get("upload_headers", {}),
            data=f.read()
        )
    
    if upload_response.status_code not in [200, 204]:
        raise Exception(f"Upload failed: {upload_response.status_code}")
    
    print(f"File uploaded successfully. Job ID: {job['job_id']}")
    return job["job_id"]

# Usage
job_id = upload_and_parse("report.pdf", "your_api_key")

import fs from 'fs';
import path from 'path';

async function uploadAndParse(filePath, apiKey) {
  const fileName = path.basename(filePath);
  
  // Step 1: Create job
  const response = await fetch('https://api.knowhereto.ai/v1/jobs', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      source_type: 'file',
      file_name: fileName
    })
  });
  
  if (!response.ok) {
    throw new Error(`Failed to create job: ${response.status}`);
  }
  
  const job = await response.json();
  
  // Step 2: Upload file
  const fileBuffer = fs.readFileSync(filePath);
  const uploadResponse = await fetch(job.upload_url, {
    method: 'PUT',
    headers: job.upload_headers || {},
    body: fileBuffer
  });
  
  if (!uploadResponse.ok) {
    throw new Error(`Upload failed: ${uploadResponse.status}`);
  }
  
  console.log(`File uploaded successfully. Job ID: ${job.job_id}`);
  return job.job_id;
}

// Usage
const jobId = await uploadAndParse('report.pdf', 'your_api_key');

Python with Streaming (Large Files)

For very large files, stream the upload to avoid loading the entire file into memory:

import requests
from pathlib import Path

def upload_large_file(file_path: str, api_key: str) -> str:
    """Upload a large file using streaming."""
    
    file_path = Path(file_path)
    file_size = file_path.stat().st_size
    
    # Step 1: Create job
    response = requests.post(
        "https://api.knowhereto.ai/v1/jobs",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        },
        json={
            "source_type": "file",
            "file_name": file_path.name
        }
    )
    response.raise_for_status()
    job = response.json()
    
    # Step 2: Stream upload
    headers = job.get("upload_headers", {})
    headers["Content-Length"] = str(file_size)
    
    with open(file_path, "rb") as f:
        upload_response = requests.put(
            job["upload_url"],
            headers=headers,
            data=f  # Stream the file
        )
    
    if upload_response.status_code not in [200, 204]:
        raise Exception(f"Upload failed: {upload_response.status_code}")
    
    return job["job_id"]

Supported File Types

Format	Extension	MIME Type	Max Size
PDF	`.pdf`	`application/pdf`	100 MB
Word	`.docx`	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	50 MB
Excel	`.xlsx`	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`	50 MB
PowerPoint	`.pptx`	`application/vnd.openxmlformats-officedocument.presentationml.presentation`	100 MB

Common Issues

Upload URL Expired

Upload URLs expire after 1 hour. If you get a 403 Forbidden error:

{
  "Code": "AccessDenied",
  "Message": "Request has expired"
}

Solution: Create a new job to get a fresh upload URL.

Wrong Content-Type

If the Content-Type header doesn't match what the presigned URL expects:

{
  "Code": "AccessDenied",
  "Message": "There was an error with the provided signature"
}

Solution: Use the exact headers from upload_headers.

File Too Large

If the file exceeds the size limit:

{
  "Code": "EntityTooLarge",
  "Message": "Your proposed upload exceeds the maximum allowed size"
}

Solution: Reduce file size or contact support for higher limits.

Connection Timeout

For large files on slow connections:

Solution:

Use streaming uploads
Increase client timeout settings
Consider compressing the file if possible

Best Practices

Always use upload_headers: The presigned URL requires specific headers
Handle upload failures: Implement retry logic for transient errors
Stream large files: Avoid loading large files entirely into memory
Set appropriate timeouts: Large files may take longer to upload
Verify upload success: Check for 200 or 204 status codes

Next Steps

Job Lifecycle - Understand job states
Polling Guide - Wait for job completion
Result Handling - Process the results

Overview​

Step-by-Step Guide​

Step 1: Create Job and Get Upload URL​

Step 2: Upload the File​

Step 3: Processing Begins Automatically​

Complete Examples​

Python with Streaming (Large Files)​

Supported File Types​

Common Issues​

Upload URL Expired​

Wrong Content-Type​

File Too Large​

Connection Timeout​

Best Practices​

Next Steps​