> ## Documentation Index
> Fetch the complete documentation index at: https://docs.deasylabs.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Quickstart

> Go from zero to extracting metadata in under 5 minutes

Get started with Unstructured by Collibra by running this complete example. You'll connect to a data source, define what metadata to extract, and see results in minutes.

## Installation

Download and install the Python SDK wheel file:

<Card title="Request Access" icon="envelope" href="mailto:leonard.platzer@collibra.com?subject=Unstructured%20SDK%20Access%20Request">
  Contact us to get the SDK wheel file
</Card>

```bash theme={null}
pip install unstructured_sdk-*.whl
```

## Supported Filetypes

Unstructured by Collibra can process a wide variety of document formats:

<CardGroup cols={3}>
  <Card title="PDF" icon="file-pdf">
    .pdf
  </Card>

  <Card title="Microsoft Word" icon="file-word">
    .docx
  </Card>

  <Card title="Microsoft Excel" icon="file-excel">
    .xls, .xlsx, .xlsm, .xlsb
  </Card>

  <Card title="Microsoft PowerPoint" icon="file-powerpoint">
    .ppt, .pptx
  </Card>

  <Card title="OpenDocument" icon="file-lines">
    .odf, .ods, .odt
  </Card>

  <Card title="Data Formats" icon="brackets-curly">
    .json, .csv
  </Card>

  <Card title="Plain Text" icon="file">
    Any other file (UTF-8)
  </Card>
</CardGroup>

## Language Support

| Capability              | Supported Languages                          |
| :---------------------- | :------------------------------------------- |
| **Document Processing** | Multilingual (all UTF-8 supported languages) |
| **LLM Classification**  | Multilingual (dependent on model)            |
| **PII Detection**       | English only                                 |

## Complete Example

Copy and run this script to extract metadata from your documents:

```python theme={null}
from unstructured import UnstructuredClient

# 1. Initialize the client (basic auth)
client = UnstructuredClient(
    base_url="https://unstructured.your-company.com/rest/unstructured",
    username="your-username",
    password="your-password",
)
# Alternatively, authenticate with an API token issued from the web UI:
#   client = UnstructuredClient(
#       base_url="https://unstructured.your-company.com/rest/unstructured",
#       api_token="your-api-token",
#       user_id="your-username",  # sent as the X-User-ID header
#   )

# 2. Create a data connector (S3 example)
connector = client.data_source.create(
    connector_name="my-s3-bucket",
    connector_body={
        "vector_db_type": "s3",
        "bucket_name": "my-documents",
        "aws_access_key_id": "YOUR_ACCESS_KEY",
        "aws_secret_access_key": "YOUR_SECRET_KEY",
        "region": "us-east-1",
    },
)
print(f"✓ Created connector: {connector.profile_id}")

# 3. Define a taxonomy with tags
taxonomy = client.taxonomy.upsert(
    taxonomy_name="document-classification",
    taxonomy_description="Classify and extract key info from documents",
    tags=[
        {
            "name": "document_type",
            "description": "Type of document (invoice, contract, report, etc.)",
            "output_type": "word",
        },
        {
            "name": "summary",
            "description": "A brief 2-3 sentence summary of the document",
            "output_type": "string",
        },
        {
            "name": "key_dates",
            "description": "Important dates mentioned in the document",
            "output_type": "list[date]",
        },
    ],
)
print(f"✓ Created taxonomy: {taxonomy.taxonomy_id}")

# 4. Extract metadata from documents
results = client.classify.generate_batch(
    connector_name="my-s3-bucket",
    taxonomy_name="document-classification",
)

# 5. View the results
for result in results.metadata:
    print(f"\nFile: {result.file_name}")
    print(f"  Type: {result.tags.get('document_type')}")
    print(f"  Summary: {result.tags.get('summary')}")
    print(f"  Key Dates: {result.tags.get('key_dates')}")
```

<Note>
  **About client configuration**

  * **`base_url`** is required and points to your Unstructured deployment (e.g. `https://unstructured.your-company.com/rest/unstructured`). There is no default.
  * Any constructor argument can be set via an environment variable instead: `UNSTRUCTURED_CLIENT_BASE_URL`, `UNSTRUCTURED_USERNAME`, `UNSTRUCTURED_PASSWORD`, `UNSTRUCTURED_API_TOKEN`, `UNSTRUCTURED_USER_ID`.
  * **API tokens** are long-lived and issued from your Unstructured deployment's web UI. The SDK does not create, refresh, or revoke them.
</Note>

## What Just Happened?

<Steps>
  <Step title="Connected to Your Data">
    The Data Connector established a secure connection to your S3 bucket, allowing the platform to read your documents.
  </Step>

  <Step title="Defined What to Extract">
    The Taxonomy and Tags told the platform what information to look for — document type, summary, and key dates.
  </Step>

  <Step title="Extracted Metadata">
    The platform's AI analyzed each document and extracted the structured metadata you defined.
  </Step>
</Steps>

## Next Steps

<CardGroup cols={2}>
  <Card title="Explore Concepts" icon="book" href="/concepts/overview">
    Learn how Data Connectors, Taxonomies, and Metadata work together.
  </Card>

  <Card title="S3 to SharePoint" icon="microsoft" href="/cookbooks/s3-to-sharepoint">
    Export enriched metadata to SharePoint.
  </Card>

  <Card title="API Reference" icon="brackets-curly" href="/api-reference">
    Explore all available endpoints and SDK methods.
  </Card>

  <Card title="PII Detection" icon="shield" href="/cookbooks/pii-detection">
    Set up sensitive data detection for compliance.
  </Card>
</CardGroup>
