Quickstart

Get started with Unstructured by Collibra by running this complete example. You’ll connect to a data source, define what metadata to extract, and see results in minutes.

Installation

Download and install the Python SDK wheel file:

Request Access

pip install unstructured_sdk-*.whl

Supported Filetypes

Unstructured by Collibra can process a wide variety of document formats:

PDF

.pdf

Microsoft Word

.docx

Microsoft Excel

.xls, .xlsx, .xlsm, .xlsb

Microsoft PowerPoint

.ppt, .pptx

OpenDocument

.odf, .ods, .odt

Data Formats

.json, .csv

Plain Text

Any other file (UTF-8)

Language Support

Capability	Supported Languages
Document Processing	Multilingual (all UTF-8 supported languages)
LLM Classification	Multilingual (dependent on model)
PII Detection	English only

Complete Example

Copy and run this script to extract metadata from your documents:

from unstructured import UnstructuredClient

# 1. Initialize the client (basic auth)
client = UnstructuredClient(
    base_url="https://unstructured.your-company.com/rest/unstructured",
    username="your-username",
    password="your-password",
)
# Alternatively, authenticate with an API token issued from the web UI:
#   client = UnstructuredClient(
#       base_url="https://unstructured.your-company.com/rest/unstructured",
#       api_token="your-api-token",
#       user_id="your-username",  # sent as the X-User-ID header
#   )

# 2. Create a data connector (S3 example)
connector = client.data_source.create(
    connector_name="my-s3-bucket",
    connector_body={
        "vector_db_type": "s3",
        "bucket_name": "my-documents",
        "aws_access_key_id": "YOUR_ACCESS_KEY",
        "aws_secret_access_key": "YOUR_SECRET_KEY",
        "region": "us-east-1",
    },
)
print(f"✓ Created connector: {connector.profile_id}")

# 3. Define a taxonomy with tags
taxonomy = client.taxonomy.upsert(
    taxonomy_name="document-classification",
    taxonomy_description="Classify and extract key info from documents",
    tags=[
        {
            "name": "document_type",
            "description": "Type of document (invoice, contract, report, etc.)",
            "output_type": "word",
        },
        {
            "name": "summary",
            "description": "A brief 2-3 sentence summary of the document",
            "output_type": "string",
        },
        {
            "name": "key_dates",
            "description": "Important dates mentioned in the document",
            "output_type": "list[date]",
        },
    ],
)
print(f"✓ Created taxonomy: {taxonomy.taxonomy_id}")

# 4. Extract metadata from documents
results = client.classify.generate_batch(
    connector_name="my-s3-bucket",
    taxonomy_name="document-classification",
)

# 5. View the results
for result in results.metadata:
    print(f"\nFile: {result.file_name}")
    print(f"  Type: {result.tags.get('document_type')}")
    print(f"  Summary: {result.tags.get('summary')}")
    print(f"  Key Dates: {result.tags.get('key_dates')}")

About client configuration

base_url is required and points to your Unstructured deployment (e.g. https://unstructured.your-company.com/rest/unstructured). There is no default.
Any constructor argument can be set via an environment variable instead: UNSTRUCTURED_CLIENT_BASE_URL, UNSTRUCTURED_USERNAME, UNSTRUCTURED_PASSWORD, UNSTRUCTURED_API_TOKEN, UNSTRUCTURED_USER_ID.
API tokens are long-lived and issued from your Unstructured deployment’s web UI. The SDK does not create, refresh, or revoke them.

What Just Happened?

Connected to Your Data

The Data Connector established a secure connection to your S3 bucket, allowing the platform to read your documents.

Defined What to Extract

The Taxonomy and Tags told the platform what information to look for — document type, summary, and key dates.

Extracted Metadata

The platform’s AI analyzed each document and extracted the structured metadata you defined.

Next Steps

Explore Concepts

Learn how Data Connectors, Taxonomies, and Metadata work together.

S3 to SharePoint

Export enriched metadata to SharePoint.

API Reference

Explore all available endpoints and SDK methods.

PII Detection

Set up sensitive data detection for compliance.

Getting Started

Core Concepts

Cookbooks

Installation

Request Access

Supported Filetypes

PDF

Microsoft Word

Microsoft Excel

Microsoft PowerPoint

OpenDocument

Data Formats

Plain Text

Language Support

Complete Example

What Just Happened?

Next Steps

Explore Concepts

S3 to SharePoint

API Reference

PII Detection

​Installation

Request Access

​Supported Filetypes

PDF

Microsoft Word

Microsoft Excel

Microsoft PowerPoint

OpenDocument

Data Formats

Plain Text

​Language Support

​Complete Example

​What Just Happened?

​Next Steps

Explore Concepts

S3 to SharePoint

API Reference

PII Detection

Installation

Supported Filetypes

Language Support

Complete Example

What Just Happened?

Next Steps