File Upload Guide
This guide covers everything you need to know about uploading files to Knowhere API using presigned URLs.
Overview
When you have a local file to parse, the upload process involves two steps:
- Create a job with
source_type: "file"to get an upload URL - Upload the file directly to cloud storage using the presigned URL
This approach offers several benefits:
- Files upload directly to storage (faster, more reliable)
- Large files don't timeout your API requests
- Secure, time-limited upload URLs
Step-by-Step Guide
Step 1: Create Job and Get Upload URL
curl -X POST https://api.knowhereto.ai/v1/jobs \
-H "Authorization: Bearer $KNOWHERE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"source_type": "file",
"file_name": "document.pdf"
}'
Response:
{
"job_id": "job_abc123",
"status": "waiting-file",
"upload_url": "https://storage.knowhereto.ai/uploads/...",
"upload_headers": {
"Content-Type": "application/pdf"
},
"created_at": "2025-01-15T10:30:00Z"
}
Step 2: Upload the File
Use HTTP PUT to upload the file binary to the upload_url:
curl -X PUT "https://storage.knowhereto.ai/uploads/..." \
-H "Content-Type: application/pdf" \
--data-binary @document.pdf
You must use the headers provided in upload_headers. The presigned URL is configured to expect specific headers, and any mismatch will cause the upload to fail.
Step 3: Processing Begins Automatically
Once the upload completes (HTTP 200 or 204), processing starts automatically. The job status transitions from waiting-file to pending.
Complete Examples
- Python
- Node.js
import requests
from pathlib import Path
def upload_and_parse(file_path: str, api_key: str) -> str:
"""Upload a file and return the job_id."""
file_path = Path(file_path)
# Step 1: Create job
response = requests.post(
"https://api.knowhereto.ai/v1/jobs",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"source_type": "file",
"file_name": file_path.name
}
)
response.raise_for_status()
job = response.json()
# Step 2: Upload file
with open(file_path, "rb") as f:
upload_response = requests.put(
job["upload_url"],
headers=job.get("upload_headers", {}),
data=f.read()
)
if upload_response.status_code not in [200, 204]:
raise Exception(f"Upload failed: {upload_response.status_code}")
print(f"File uploaded successfully. Job ID: {job['job_id']}")
return job["job_id"]
# Usage
job_id = upload_and_parse("report.pdf", "your_api_key")
import fs from 'fs';
import path from 'path';
async function uploadAndParse(filePath, apiKey) {
const fileName = path.basename(filePath);
// Step 1: Create job
const response = await fetch('https://api.knowhereto.ai/v1/jobs', {
method: 'POST',
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
source_type: 'file',
file_name: fileName
})
});
if (!response.ok) {
throw new Error(`Failed to create job: ${response.status}`);
}
const job = await response.json();
// Step 2: Upload file
const fileBuffer = fs.readFileSync(filePath);
const uploadResponse = await fetch(job.upload_url, {
method: 'PUT',
headers: job.upload_headers || {},
body: fileBuffer
});
if (!uploadResponse.ok) {
throw new Error(`Upload failed: ${uploadResponse.status}`);
}
console.log(`File uploaded successfully. Job ID: ${job.job_id}`);
return job.job_id;
}
// Usage
const jobId = await uploadAndParse('report.pdf', 'your_api_key');
Python with Streaming (Large Files)
For very large files, stream the upload to avoid loading the entire file into memory:
import requests
from pathlib import Path
def upload_large_file(file_path: str, api_key: str) -> str:
"""Upload a large file using streaming."""
file_path = Path(file_path)
file_size = file_path.stat().st_size
# Step 1: Create job
response = requests.post(
"https://api.knowhereto.ai/v1/jobs",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"source_type": "file",
"file_name": file_path.name
}
)
response.raise_for_status()
job = response.json()
# Step 2: Stream upload
headers = job.get("upload_headers", {})
headers["Content-Length"] = str(file_size)
with open(file_path, "rb") as f:
upload_response = requests.put(
job["upload_url"],
headers=headers,
data=f # Stream the file
)
if upload_response.status_code not in [200, 204]:
raise Exception(f"Upload failed: {upload_response.status_code}")
return job["job_id"]
Supported File Types
| Format | Extension | MIME Type | Max Size |
|---|---|---|---|
.pdf | application/pdf | 100 MB | |
| Word | .docx | application/vnd.openxmlformats-officedocument.wordprocessingml.document | 50 MB |
| Excel | .xlsx | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | 50 MB |
| PowerPoint | .pptx | application/vnd.openxmlformats-officedocument.presentationml.presentation | 100 MB |
Common Issues
Upload URL Expired
Upload URLs expire after 1 hour. If you get a 403 Forbidden error:
{
"Code": "AccessDenied",
"Message": "Request has expired"
}
Solution: Create a new job to get a fresh upload URL.
Wrong Content-Type
If the Content-Type header doesn't match what the presigned URL expects:
{
"Code": "AccessDenied",
"Message": "There was an error with the provided signature"
}
Solution: Use the exact headers from upload_headers.
File Too Large
If the file exceeds the size limit:
{
"Code": "EntityTooLarge",
"Message": "Your proposed upload exceeds the maximum allowed size"
}
Solution: Reduce file size or contact support for higher limits.
Connection Timeout
For large files on slow connections:
Solution:
- Use streaming uploads
- Increase client timeout settings
- Consider compressing the file if possible
Best Practices
- Always use
upload_headers: The presigned URL requires specific headers - Handle upload failures: Implement retry logic for transient errors
- Stream large files: Avoid loading large files entirely into memory
- Set appropriate timeouts: Large files may take longer to upload
- Verify upload success: Check for 200 or 204 status codes
Next Steps
- Job Lifecycle - Understand job states
- Polling Guide - Wait for job completion
- Result Handling - Process the results