week 150 - Cortex (AI / ML)

In real-world contracts, headers often carry subtle variations: numbering, punctuation, or word orderbut they refer to the same section. Your mission: unify these headers across multiple companies using AI-powered similarity and canonicalization, so downstream comparisons all speak the same language.

What you’ll do this week

  • Write a Snowflake SQL pipeline that discovers “almost duplicates” among section headers using AI_SIMILARITY, then clusters them into stable groups.
  • For each group, use AI_COMPLETE to choose one crisp canonical header (3–5 words, no numbering or punctuation).
  • Map every original header in your dataset to its normalized label.
  • Run diagnostics: show how many unique raw headers existed vs how many normalized header groups you produced, and present a few examples where messy variants collapse cleanly.
  • All of this should run hands-off—no external tools, just Snowpark-native AI in Snowflake.
CREATE OR REPLACE SCHEMA FROSTY_DB.week_148;
CREATE OR REPLACE TABLE week_148.HEADERS_SAMPLE (COMPANY STRING,DOC_ID STRING,HEADER_ORDER INTEGER,HEADER_TEXT STRING);
INSERT INTO week_148.HEADERS_SAMPLE (COMPANY, DOC_ID, HEADER_ORDER, HEADER_TEXT) VALUES
-- Company A
('AIG', 'A1', 1, '1. Coverage Overview'),
('AIG', 'A1', 2, '1.1 Policy Benefits'),
('AIG', 'A1', 3, '2. Exclusions & Limitations'),
('AIG', 'A1', 4, '3. Claims Process'),
-- Company B (variants)
('Zurich', 'Z1', 1, 'Coverage - Overview'),
('Zurich', 'Z1', 2, 'Policy Benefit(s)'),
('Zurich', 'Z1', 3, 'Exclusions and Limitations'),
('Zurich', 'Z1', 4, 'Claims – Process'),
-- Company C (more variants)
('Allianz', 'AL1', 1, '01) COVERAGE OVERVIEW'),
('Allianz', 'AL1', 2, 'Policy: Benefits'),
('Allianz', 'AL1', 3, 'Exclusions / Limitations'),
('Allianz', 'AL1', 4, 'Claims Processing'),
-- Company D (edge cases)
('AXA', 'AX1', 1, 'Coverage overview & scope'),
('AXA', 'AX1', 2, 'Benefit of Policy'),
('AXA', 'AX1', 3, 'Limitations and Exclusions'),
('AXA', 'AX1', 4, 'Filing a Claim');
SELECT * FROM week_148.HEADERS_SAMPLE ORDER BY COMPANY, DOC_ID, HEADER_ORDER;

Your objective :

Previous
Previous

Week 151 - Cortex Agents

Next
Next

Week 149 - SQL & Query Techniques