Skip to main content

File Upload Guide

This guide covers everything you need to know about uploading files to Knowhere API using presigned URLs.

Overview

When you have a local file to parse, the upload process involves two steps:

  1. Create a job with source_type: "file" to get an upload URL
  2. Upload the file directly to cloud storage using the presigned URL

This approach offers several benefits:

  • Files upload directly to storage (faster, more reliable)
  • Large files don't timeout your API requests
  • Secure, time-limited upload URLs

Step-by-Step Guide

Step 1: Create Job and Get Upload URL

curl -X POST https://api.knowhereto.ai/v1/jobs \
-H "Authorization: Bearer $KNOWHERE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"source_type": "file",
"file_name": "document.pdf"
}'

Response:

{
"job_id": "job_abc123",
"status": "waiting-file",
"upload_url": "https://storage.knowhereto.ai/uploads/...",
"upload_headers": {
"Content-Type": "application/pdf"
},
"created_at": "2025-01-15T10:30:00Z"
}

Step 2: Upload the File

Use HTTP PUT to upload the file binary to the upload_url:

curl -X PUT "https://storage.knowhereto.ai/uploads/..." \
-H "Content-Type: application/pdf" \
--data-binary @document.pdf
Use Exact Headers

You must use the headers provided in upload_headers. The presigned URL is configured to expect specific headers, and any mismatch will cause the upload to fail.

Step 3: Processing Begins Automatically

Once the upload completes (HTTP 200 or 204), processing starts automatically. The job status transitions from waiting-file to pending.

Complete Examples

import requests
from pathlib import Path

def upload_and_parse(file_path: str, api_key: str) -> str:
"""Upload a file and return the job_id."""

file_path = Path(file_path)

# Step 1: Create job
response = requests.post(
"https://api.knowhereto.ai/v1/jobs",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"source_type": "file",
"file_name": file_path.name
}
)
response.raise_for_status()
job = response.json()

# Step 2: Upload file
with open(file_path, "rb") as f:
upload_response = requests.put(
job["upload_url"],
headers=job.get("upload_headers", {}),
data=f.read()
)

if upload_response.status_code not in [200, 204]:
raise Exception(f"Upload failed: {upload_response.status_code}")

print(f"File uploaded successfully. Job ID: {job['job_id']}")
return job["job_id"]

# Usage
job_id = upload_and_parse("report.pdf", "your_api_key")

Python with Streaming (Large Files)

For very large files, stream the upload to avoid loading the entire file into memory:

import requests
from pathlib import Path

def upload_large_file(file_path: str, api_key: str) -> str:
"""Upload a large file using streaming."""

file_path = Path(file_path)
file_size = file_path.stat().st_size

# Step 1: Create job
response = requests.post(
"https://api.knowhereto.ai/v1/jobs",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"source_type": "file",
"file_name": file_path.name
}
)
response.raise_for_status()
job = response.json()

# Step 2: Stream upload
headers = job.get("upload_headers", {})
headers["Content-Length"] = str(file_size)

with open(file_path, "rb") as f:
upload_response = requests.put(
job["upload_url"],
headers=headers,
data=f # Stream the file
)

if upload_response.status_code not in [200, 204]:
raise Exception(f"Upload failed: {upload_response.status_code}")

return job["job_id"]

Supported File Types

FormatExtensionMIME TypeMax Size
PDF.pdfapplication/pdf100 MB
Word.docxapplication/vnd.openxmlformats-officedocument.wordprocessingml.document50 MB
Excel.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet50 MB
PowerPoint.pptxapplication/vnd.openxmlformats-officedocument.presentationml.presentation100 MB

Common Issues

Upload URL Expired

Upload URLs expire after 1 hour. If you get a 403 Forbidden error:

{
"Code": "AccessDenied",
"Message": "Request has expired"
}

Solution: Create a new job to get a fresh upload URL.

Wrong Content-Type

If the Content-Type header doesn't match what the presigned URL expects:

{
"Code": "AccessDenied",
"Message": "There was an error with the provided signature"
}

Solution: Use the exact headers from upload_headers.

File Too Large

If the file exceeds the size limit:

{
"Code": "EntityTooLarge",
"Message": "Your proposed upload exceeds the maximum allowed size"
}

Solution: Reduce file size or contact support for higher limits.

Connection Timeout

For large files on slow connections:

Solution:

  • Use streaming uploads
  • Increase client timeout settings
  • Consider compressing the file if possible

Best Practices

  1. Always use upload_headers: The presigned URL requires specific headers
  2. Handle upload failures: Implement retry logic for transient errors
  3. Stream large files: Avoid loading large files entirely into memory
  4. Set appropriate timeouts: Large files may take longer to upload
  5. Verify upload success: Check for 200 or 204 status codes

Next Steps