Installation
Download and install the Python SDK wheel file:Request Access
Contact us to get the SDK wheel file
Supported Filetypes
Unstructured by Collibra can process a wide variety of document formats:.pdf
Microsoft Word
.docx
Microsoft Excel
.xls, .xlsx, .xlsm, .xlsb
Microsoft PowerPoint
.ppt, .pptx
OpenDocument
.odf, .ods, .odt
Data Formats
.json, .csv
Plain Text
Any other file (UTF-8)
Language Support
| Capability | Supported Languages |
|---|---|
| Document Processing | Multilingual (all UTF-8 supported languages) |
| LLM Classification | Multilingual (dependent on model) |
| PII Detection | English only |
Complete Example
Copy and run this script to extract metadata from your documents:What Just Happened?
1
Connected to Your Data
The Data Connector established a secure connection to your S3 bucket, allowing the platform to read your documents.
2
Defined What to Extract
The Taxonomy and Tags told the platform what information to look for — document type, summary, and key dates.
3
Extracted Metadata
The platform’s AI analyzed each document and extracted the structured metadata you defined.

