Define what metadata to extract from your documents
Tags are the metadata attributes you want to extract or classify from your documents. Taxonomies organize tags into hierarchical structures that define parent-child relationships.
Classification is best when you have a known set of categories. Extraction is best for unpredictable values like names, dates, or amounts. Pattern is best for structured data like phone numbers or SSNs.
A Taxonomy enables hierarchical organization and conditional extraction. Child tags only get generated when parent conditions are met:
Copy
Ask AI
Document Type (Classification: Contract | Invoice | Report)├── Contract│ ├── Contract Value (Extraction)│ ├── Parties Involved (Extraction)│ ├── Effective Date (Extraction)│ └── Termination Date (Extraction)├── Invoice│ ├── Invoice Amount (Extraction)│ ├── Due Date (Extraction)│ └── Vendor Name (Extraction)└── Report ├── Report Category (Classification: Financial | Operational | Compliance) └── Report Period (Extraction)
In this taxonomy, the AI first classifies the document type, then only extracts the relevant child tags. A Contract won’t have “Invoice Amount” extracted — saving time and cost.
from unstructured import UnstructuredClientclient = UnstructuredClient( username="your-username", password="your-password",)# Create a taxonomy with tagstaxonomy = client.taxonomy.upsert( taxonomy_name="contract-analysis", taxonomy_description="Extract key data from legal contracts", tags=[ { "name": "contract_type", "description": "Type of contract (NDA, MSA, SLA, etc.)", "output_type": "word", }, { "name": "parties", "description": "Names of all parties involved", "output_type": "list[string]", }, { "name": "effective_date", "description": "When the contract becomes effective", "output_type": "date", }, { "name": "total_value", "description": "Total monetary value in USD", "output_type": "float", }, ],)print(f"Created taxonomy: {taxonomy.taxonomy_id}")
Copy
Ask AI
# Create or update a single tagtag = client.tags.upsert( tag_name="document_type", tag_description="Classify document type", output_type="word", available_values=["Contract", "Invoice", "Report", "Letter"],)print(f"Created tag: {tag.tag_id}")
Copy
Ask AI
# Get AI-suggested taxonomy for your domainsuggestions = client.suggest.schema( description="Medical patient records with diagnoses and prescriptions",)print("Suggested tags:")for tag in suggestions.tags: print(f" - {tag.name}: {tag.description}")# Get AI-suggested regex patternspattern = client.suggest.patterns( description="US Social Security Number (XXX-XX-XXXX)",)print(f"Suggested pattern: {pattern.regex}")
Copy
Ask AI
# List all taxonomiestaxonomies = client.taxonomy.list()for t in taxonomies.taxonomies: print(f"{t.taxonomy_name}: {t.tag_count} tags")# List all tagstags = client.tags.list()for tag in tags.tags: print(f"{tag.tag_name} ({tag.output_type})")# Delete a taxonomyclient.taxonomy.delete(taxonomy_name="old-taxonomy")# Delete a tagclient.tags.delete(tag_name="unused-tag")