Skip to main content
Get started with Unstructured by Collibra by running this complete example. You’ll connect to a data source, define what metadata to extract, and see results in minutes.

Installation

Install the Python SDK:
pip install unstructured-sdk

Complete Example

Copy and run this script to extract metadata from your documents:
from unstructured import UnstructuredClient

# 1. Initialize the client
client = UnstructuredClient(
    username="your-username",
    password="your-password",
)

# 2. Create a data connector (S3 example)
connector = client.data_source.create(
    connector_name="my-s3-bucket",
    connector_body={
        "vector_db_type": "s3",
        "bucket_name": "my-documents",
        "aws_access_key_id": "YOUR_ACCESS_KEY",
        "aws_secret_access_key": "YOUR_SECRET_KEY",
        "region": "us-east-1",
    },
)
print(f"✓ Created connector: {connector.profile_id}")

# 3. Define a taxonomy with tags
taxonomy = client.taxonomy.upsert(
    taxonomy_name="document-classification",
    taxonomy_description="Classify and extract key info from documents",
    tags=[
        {
            "name": "document_type",
            "description": "Type of document (invoice, contract, report, etc.)",
            "output_type": "word",
        },
        {
            "name": "summary",
            "description": "A brief 2-3 sentence summary of the document",
            "output_type": "string",
        },
        {
            "name": "key_dates",
            "description": "Important dates mentioned in the document",
            "output_type": "list[date]",
        },
    ],
)
print(f"✓ Created taxonomy: {taxonomy.taxonomy_id}")

# 4. Extract metadata from documents
results = client.classify.generate_batch(
    connector_name="my-s3-bucket",
    taxonomy_name="document-classification",
)

# 5. View the results
for result in results.metadata:
    print(f"\nFile: {result.file_name}")
    print(f"  Type: {result.tags.get('document_type')}")
    print(f"  Summary: {result.tags.get('summary')}")
    print(f"  Key Dates: {result.tags.get('key_dates')}")

What Just Happened?

1

Connected to Your Data

The Data Connector established a secure connection to your S3 bucket, allowing the platform to read your documents.
2

Defined What to Extract

The Taxonomy and Tags told the platform what information to look for — document type, summary, and key dates.
3

Extracted Metadata

The platform’s AI analyzed each document and extracted the structured metadata you defined.

Next Steps


Development Setup

Want to contribute to the docs? Here’s how to run them locally:
  1. Install the Mintlify CLI:
pnpm add -g mintlify
  1. Run the development server:
mintlify dev