Skip to content

feat(typing): add databricks type inference for REGEXP_INSTR, REGEXP_LIKE, REGEXP_SUBSTR, REGR_AVGX#7815

Open
fivetran-amrutabhimsenayachit wants to merge 2 commits into
mainfrom
type-inference-batch-2
Open

feat(typing): add databricks type inference for REGEXP_INSTR, REGEXP_LIKE, REGEXP_SUBSTR, REGR_AVGX#7815
fivetran-amrutabhimsenayachit wants to merge 2 commits into
mainfrom
type-inference-batch-2

Conversation

@fivetran-amrutabhimsenayachit

@fivetran-amrutabhimsenayachit fivetran-amrutabhimsenayachit commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds Databricks type inference support for REGEXP_INSTR (INT), REGEXP_LIKE (BOOLEAN via fixture), REGEXP_SUBSTR (VARCHAR via parser mapping to RegexpExtract), and REGR_AVGX (DOUBLE), plus fixture coverage for all four functions.

Tickets

  • RD-1229633 (REGEXP_INSTR) — new INT entry in typing/databricks.py
  • RD-1229634 (REGEXP_LIKE) — fixture entry only (inherited from base typing)
  • RD-1229635 (REGEXP_REPLACE) — already complete, no changes
  • RD-1229636 (REGEXP_SUBSTR) — parser mapping REGEXP_SUBSTR→RegexpExtract + fixture entry
  • RD-1229637 (REGR_AVGX) — new DOUBLE entry in typing/databricks.py

Test plan

  • make style — PASS
  • make unit — PASS

@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

SQLGlot Integration Test Results

✅ All tests passed

Comparing:

  • this branch (sqlglot:type-inference-batch-2 @ sqlglot 2195638)
  • baseline (main @ sqlglot fd6d4d6)

By Dialect

dialect main feature branch transitions links
databricks -> databricks 9982/11820 passed (84.5%) 9998/11820 passed (84.6%) 16 fail -> pass full result / delta

Overall

main: 192428 total, 153523 passed (pass rate: 79.8%)

sqlglot:type-inference-batch-2: 180234 total, 142394 passed (pass rate: 79.0%)

Transitions:
16 fail -> pass

Dialect pair changes: 0 previous results not found, 3 current results not found

✅ All tests passed

@geooo109 geooo109 self-assigned this Jun 30, 2026
VARCHAR;

# dialect: databricks
REGR_AVGX(tbl.double_col, tbl.double_col);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add tests containing the ALL, DISTINCT, andREGR_AVGX(...) OVER (PARTITION BY 1)

Comment on lines +480 to +481
class RegexpSubstr(Expression, Func):
arg_types = {"this": True, "expression": True}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if we can reuse exp.RegexpExtract for this? The default group should be 0 here. So, check the semantics for regexp_extract vs regexp_substr, if they match (for group 0), we can reuse the existing expression. If they don't match, let's keep this addition and add round-trip tests.

Minimal example:
SELECT regexp_extract('Order: 100-200', '(\\d+)-(\\d+)', 0)
> 100-200

SELECT regexp_substr('Order: 100-200', '(\\d+)-(\\d+)')
> 100 -200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants