week 150 - Cortex (AI / ML)
In real-world contracts, headers often carry subtle variations: numbering, punctuation, or word orderbut they refer to the same section. Your mission: unify these headers across multiple companies using AI-powered similarity and canonicalization, so downstream comparisons all speak the same language.
What you’ll do this week
- Write a Snowflake SQL pipeline that discovers “almost duplicates” among section headers using AI_SIMILARITY, then clusters them into stable groups.
- For each group, use AI_COMPLETE to choose one crisp canonical header (3–5 words, no numbering or punctuation).
- Map every original header in your dataset to its normalized label.
- Run diagnostics: show how many unique raw headers existed vs how many normalized header groups you produced, and present a few examples where messy variants collapse cleanly.
- All of this should run hands-off—no external tools, just Snowpark-native AI in Snowflake.
CREATE OR REPLACE SCHEMA FROSTY_DB.week_148;CREATE OR REPLACE TABLE week_148.HEADERS_SAMPLE (COMPANY STRING,DOC_ID STRING,HEADER_ORDER INTEGER,HEADER_TEXT STRING);INSERT INTO week_148.HEADERS_SAMPLE (COMPANY, DOC_ID, HEADER_ORDER, HEADER_TEXT) VALUES
-- Company A
('AIG', 'A1', 1, '1. Coverage Overview'),
('AIG', 'A1', 2, '1.1 Policy Benefits'),
('AIG', 'A1', 3, '2. Exclusions & Limitations'),
('AIG', 'A1', 4, '3. Claims Process'),-- Company B (variants)
('Zurich', 'Z1', 1, 'Coverage - Overview'),
('Zurich', 'Z1', 2, 'Policy Benefit(s)'),
('Zurich', 'Z1', 3, 'Exclusions and Limitations'),
('Zurich', 'Z1', 4, 'Claims – Process'),-- Company C (more variants)
('Allianz', 'AL1', 1, '01) COVERAGE OVERVIEW'),
('Allianz', 'AL1', 2, 'Policy: Benefits'),
('Allianz', 'AL1', 3, 'Exclusions / Limitations'),
('Allianz', 'AL1', 4, 'Claims Processing'),-- Company D (edge cases)
('AXA', 'AX1', 1, 'Coverage overview & scope'),
('AXA', 'AX1', 2, 'Benefit of Policy'),
('AXA', 'AX1', 3, 'Limitations and Exclusions'),
('AXA', 'AX1', 4, 'Filing a Claim');SELECT * FROM week_148.HEADERS_SAMPLE ORDER BY COMPANY, DOC_ID, HEADER_ORDER;Your objective :
