Merge pull request #292 from microsoft/Delta-Analyzer

Adding Delta Analyzer to main Microsoft AS Repo
This commit is contained in:
dax-tips 2024-09-06 10:01:21 +12:00 коммит произвёл GitHub
Родитель 553ba8d8af ec630325ac
Коммит 72710f628f
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: B5690EEEBB952194
2 изменённых файлов: 38 добавлений и 0 удалений

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

37
DeltaAnalyzer/README.md Normal file
Просмотреть файл

@ -0,0 +1,37 @@
# DeltaAnalyzer
Instructions:
- Connect Notebook to Lakehouse (does not have to be in same Workspace)
- Update deltaTable parameter immedately below instructions with name of table to analyze
- Review other parameters (append vs overwrite)
- Run
- Review four output tables that have "zz_n_DeltaAnalyzerOutput"
zz_1_DeltaAnalyzerOutput_parquetFiles
This table has one row per Parquet file
Ideally, there should not be thousands of these
This table only uses parquet file metadata and should be quick to populate
zz_2_DeltaAnalyzerOutput_rowRowgroups
This table has one row per rowgroup and shows rowgroups for every parquet file
Look for the number of rows per rowgroup. Ideally this should be 1M to 16M rows (higher the better)
This table only uses parquet file metadata and should be quick to populate
zz_3_DeltaAnalyzerOutput_columnChunks
One row per column/chunk within rowgroups
Large number of output and has much more detail about dictionaries and compression
This table only uses parquet file metadata and should be quick to populate
zz_4_DeltaAnalyzerOutput_columns
One row per column of the table
Look to see how many unique values per column. If using floating point, consider modifying parquet file to use DECIMAL(17,4)
This table runs a compute query against the Detla table so may take time depending on size of Delta table
Run
%%sql
OPTIMIZE tablename vorder
Footnote:
Useful doc
https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/latest/index.html