What You’ll Build
Prerequisites
- An S3 bucket with PDF documents
- A SharePoint site with appropriate permissions
- Python 3.9+
Complete Pipeline
What Happens in SharePoint
Once exported, your SharePoint document library will have new columns populated with the extracted metadata:| Document | Contract Type | Parties | Effective Date | Total Value |
|---|---|---|---|---|
| Acme-NDA-2024.pdf | NDA | Acme Corp, Beta LLC | 2024-01-15 | - |
| ServiceAgreement.pdf | MSA | TechCo, StartupXYZ | 2024-03-01 | $150,000 |
| Employment-JDoe.pdf | Employment | Jane Doe, Acme Corp | 2024-02-01 | $95,000 |
- Filter and sort documents by contract type, date, or value
- Create views like “Expiring This Quarter” or “High-Value Contracts”
- Search using SharePoint’s native search with metadata facets
- Set up alerts for documents matching specific criteria
Understanding Export Options
| Option | Description | When to Use |
|---|---|---|
export_level="file" | One record per document | SharePoint, document management |
export_level="chunk" | One record per chunk | Vector databases, RAG |
export_level="both" | Both file and chunk records | Hybrid use cases |
metadata_format="column_store" | Metadata as separate columns | SharePoint, SQL databases |
metadata_format="json_store" | Metadata as JSON column | Flexible NoSQL storage |
export_tags=[...] | Specific tags to export | Control which columns are created |
Production Tips
Handle Large Batches
Handle Large Batches
For large document sets (100+ files), the export runs asynchronously. Poll the tracker:
Incremental Updates
Incremental Updates
Use Data Slices to process only new documents:
Error Handling
Error Handling
Wrap operations in try-except for production robustness:
Next Steps
PII Detection
Add sensitive data detection to your pipeline.
Custom Taxonomies
Use AI to generate custom taxonomies.
Data Slices
Learn to filter documents for targeted processing.
Destinations
Explore all supported export targets.

