Skip to main content
Get started with Unstructured by Collibra by running this complete example. You’ll connect to a data source, define what metadata to extract, and see results in minutes.

Installation

Download and install the Python SDK wheel file:

Request Access

Contact us to get the SDK wheel file
pip install unstructured_sdk-*.whl

Supported Filetypes

Unstructured by Collibra can process a wide variety of document formats:

PDF

.pdf

Microsoft Word

.docx

Microsoft Excel

.xls, .xlsx, .xlsm, .xlsb

Microsoft PowerPoint

.ppt, .pptx

OpenDocument

.odf, .ods, .odt

Data Formats

.json, .csv

Plain Text

Any other file (UTF-8)

Language Support

CapabilitySupported Languages
Document ProcessingMultilingual (all UTF-8 supported languages)
LLM ClassificationMultilingual (dependent on model)
PII DetectionEnglish only

Complete Example

Copy and run this script to extract metadata from your documents:
from unstructured import UnstructuredClient

# 1. Initialize the client
client = UnstructuredClient(
    username="your-username",
    password="your-password",
)

# 2. Create a data connector (S3 example)
connector = client.data_source.create(
    connector_name="my-s3-bucket",
    connector_body={
        "vector_db_type": "s3",
        "bucket_name": "my-documents",
        "aws_access_key_id": "YOUR_ACCESS_KEY",
        "aws_secret_access_key": "YOUR_SECRET_KEY",
        "region": "us-east-1",
    },
)
print(f"✓ Created connector: {connector.profile_id}")

# 3. Define a taxonomy with tags
taxonomy = client.taxonomy.upsert(
    taxonomy_name="document-classification",
    taxonomy_description="Classify and extract key info from documents",
    tags=[
        {
            "name": "document_type",
            "description": "Type of document (invoice, contract, report, etc.)",
            "output_type": "word",
        },
        {
            "name": "summary",
            "description": "A brief 2-3 sentence summary of the document",
            "output_type": "string",
        },
        {
            "name": "key_dates",
            "description": "Important dates mentioned in the document",
            "output_type": "list[date]",
        },
    ],
)
print(f"✓ Created taxonomy: {taxonomy.taxonomy_id}")

# 4. Extract metadata from documents
results = client.classify.generate_batch(
    connector_name="my-s3-bucket",
    taxonomy_name="document-classification",
)

# 5. View the results
for result in results.metadata:
    print(f"\nFile: {result.file_name}")
    print(f"  Type: {result.tags.get('document_type')}")
    print(f"  Summary: {result.tags.get('summary')}")
    print(f"  Key Dates: {result.tags.get('key_dates')}")

What Just Happened?

1

Connected to Your Data

The Data Connector established a secure connection to your S3 bucket, allowing the platform to read your documents.
2

Defined What to Extract

The Taxonomy and Tags told the platform what information to look for — document type, summary, and key dates.
3

Extracted Metadata

The platform’s AI analyzed each document and extracted the structured metadata you defined.

Next Steps