Skip to main content
Get started with Unstructured by Collibra by running this complete example. You’ll connect to a data source, define what metadata to extract, and see results in minutes.

Installation

Download and install the Python SDK wheel file:

Request Access

Contact us to get the SDK wheel file
pip install unstructured_sdk-*.whl

Supported Filetypes

Unstructured by Collibra can process a wide variety of document formats:

PDF

.pdf

Microsoft Word

.docx

Microsoft Excel

.xls, .xlsx, .xlsm, .xlsb

Microsoft PowerPoint

.ppt, .pptx

OpenDocument

.odf, .ods, .odt

Data Formats

.json, .csv

Plain Text

Any other file (UTF-8)

Language Support

CapabilitySupported Languages
Document ProcessingMultilingual (all UTF-8 supported languages)
LLM ClassificationMultilingual (dependent on model)
PII DetectionEnglish only

Complete Example

Copy and run this script to extract metadata from your documents:
from unstructured import UnstructuredClient

# 1. Initialize the client
client = UnstructuredClient(
    username="your-username",
    password="your-password",
)

# 2. Create a data connector (S3 example)
connector = client.data_source.create(
    connector_name="my-s3-bucket",
    connector_body={
        "vector_db_type": "s3",
        "bucket_name": "my-documents",
        "aws_access_key_id": "YOUR_ACCESS_KEY",
        "aws_secret_access_key": "YOUR_SECRET_KEY",
        "region": "us-east-1",
    },
)
print(f"✓ Created connector: {connector.profile_id}")

# 3. Define a taxonomy with tags
taxonomy = client.taxonomy.upsert(
    taxonomy_name="document-classification",
    taxonomy_description="Classify and extract key info from documents",
    tags=[
        {
            "name": "document_type",
            "description": "Type of document (invoice, contract, report, etc.)",
            "output_type": "word",
        },
        {
            "name": "summary",
            "description": "A brief 2-3 sentence summary of the document",
            "output_type": "string",
        },
        {
            "name": "key_dates",
            "description": "Important dates mentioned in the document",
            "output_type": "list[date]",
        },
    ],
)
print(f"✓ Created taxonomy: {taxonomy.taxonomy_id}")

# 4. Extract metadata from documents
results = client.classify.generate_batch(
    connector_name="my-s3-bucket",
    taxonomy_name="document-classification",
)

# 5. View the results
for result in results.metadata:
    print(f"\nFile: {result.file_name}")
    print(f"  Type: {result.tags.get('document_type')}")
    print(f"  Summary: {result.tags.get('summary')}")
    print(f"  Key Dates: {result.tags.get('key_dates')}")

What Just Happened?

1

Connected to Your Data

The Data Connector established a secure connection to your S3 bucket, allowing the platform to read your documents.
2

Defined What to Extract

The Taxonomy and Tags told the platform what information to look for — document type, summary, and key dates.
3

Extracted Metadata

The platform’s AI analyzed each document and extracted the structured metadata you defined.

Next Steps

Explore Concepts

Learn how Data Connectors, Taxonomies, and Metadata work together.

S3 to SharePoint

Export enriched metadata to SharePoint.

API Reference

Explore all available endpoints and SDK methods.

PII Detection

Set up sensitive data detection for compliance.