Metadata - Unstructured by Collibra

Metadata represents the actual extracted values that result from applying Tags to your documents. While Tags define what to extract, Metadata is the extracted data itself.

Metadata Properties

Property	Description	Example
Values	The extracted or classified value(s)	`["NDA", "Non-Disclosure Agreement"]`
Evidence	Text snippet supporting the extraction	”This Non-Disclosure Agreement is entered into…”
Confidence	AI confidence score (0-1)	`0.95`

The Evidence field shows exactly where the AI found the information, making it easy to verify extractions and understand the source.

Metadata Levels

Level	Description	Use Case
File-Level	Aggregated metadata for the entire document	Document classification, search filters
Chunk-Level	Granular metadata per text segment	Precise evidence location, RAG retrieval

Metadata Standardization

The platform includes AI-powered standardization to clean and normalize extracted values:

Feature	Description
Deduplication	Merge similar values (e.g., “Inc.” and “Incorporated”)
Normalization	Standardize formats (dates, currencies, names)
Bulk Standardization	Apply standardization across multiple tags

Standardization helps ensure consistency across your metadata, making it easier to search, filter, and analyze your documents.

How Metadata Generation Works

Document Processing

Documents are chunked and prepared for analysis.

Tag Application

The AI applies your Tags to extract or classify information from each chunk.

Evidence Capture

The system captures the text snippet that supports each extraction.

Aggregation

Chunk-level metadata is aggregated to create file-level metadata.

Standardization

Optional normalization and deduplication cleans the results.

Example Metadata Output

For a contract document with a “Contract Type” classification tag:

{
  "tag": "Contract Type",
  "values": ["NDA"],
  "evidence": "This Non-Disclosure Agreement ('Agreement') is entered into as of January 1, 2024...",
  "confidence": 0.97
}

Python SDK

Generate Metadata
Batch Processing
List Metadata
Upsert & Delete

from unstructured import UnstructuredClient

client = UnstructuredClient(
    username="your-username",
    password="your-password",
)

# Generate metadata for a single document
result = client.classify.generate(
    file_path="s3://my-bucket/contract.pdf",
    taxonomy_name="contract-analysis",
)

print(f"Document: {result.file_name}")
for tag, value in result.tags.items():
    print(f"  {tag}: {value}")

# Generate metadata for all documents in a connector
results = client.classify.generate_batch(
    connector_name="my-s3-bucket",
    taxonomy_name="contract-analysis",
)

print(f"Processed {len(results.metadata)} documents")
for doc in results.metadata:
    print(f"\n{doc.file_name}:")
    print(f"  Type: {doc.tags.get('contract_type')}")
    print(f"  Value: ${doc.tags.get('total_value', 0):,.2f}")

# List metadata for documents
metadata = client.metadata.list(
    connector_name="my-s3-bucket",
    tag_names=["contract_type", "effective_date"],
)

for doc in metadata.documents:
    print(f"{doc.file_name}: {doc.tags}")

# Paginated listing for large datasets
page = client.metadata.list_paginated(
    connector_name="my-s3-bucket",
    page_size=100,
    page_number=1,
)
print(f"Page 1 of {page.total_pages}")

# Manually upsert metadata
client.metadata.upsert(
    file_name="contract.pdf",
    connector_name="my-s3-bucket",
    tags={
        "contract_type": "NDA",
        "reviewed": True,
        "reviewer": "John Doe",
    },
)

# Delete metadata for specific files
client.metadata.delete(
    connector_name="my-s3-bucket",
    file_names=["old-contract.pdf"],
)

API Reference

Generate Metadata

Generate metadata for documents

Generate Batch

Generate metadata for multiple documents

Upsert Metadata

Create or update metadata

List Metadata

List metadata for documents

List Paginated

Paginated metadata listing

Delete Metadata

Remove metadata

​Metadata Properties

​Metadata Levels

​Metadata Standardization

​How Metadata Generation Works

​Example Metadata Output

​Python SDK

​API Reference

Generate Metadata

Generate Batch

Upsert Metadata

List Metadata

List Paginated

Delete Metadata

Metadata Properties

Metadata Levels

Metadata Standardization

How Metadata Generation Works

Example Metadata Output

Python SDK

API Reference