Fantastic Presentation Paul/Denny. I really enjoyed it.
@FnordFandango
3 жыл бұрын
Excellent presentation. Thank you.
@TheFoccer
3 жыл бұрын
Excellent demo for where the delta changes are coming from source. What about the use case where source is transmitting a full refresh dataset, target is a delta lake table spanning many partitions. Is there an efficient way to identify the insert/update/delete records in this scenario?
@amansehgal9917
2 жыл бұрын
For a full refresh dataset, the upsert job should take care of it. A way to track I/U/D would be to split out your upsert process as a 3 part process. Step1: Include an update timestamp field in your target table. Step2: In your incoming refresh dataset, add a update timestamp coulumn and set it to current timestamp. UPDATE count Step3: Run upsert job with just whenMatchedUpdate clause. Count rows that have current timestamp for update INSERT count Step5: Get the row count of table - RC1. Step6: Run upsert job with just whenNotMatchedInsert clause. Get row count of table after insert - RC2 INSERT count = RC2-RC1 DELETE count Step7: Get the row count of table - RC1. Step8: Run upsert job with just whenNotMatchedDelete clause. Get row count of table after insert - RC2 DELETE count = RC1-RC2
@semiclean
3 жыл бұрын
How is CDC at file level handled if we partition the delta table ? Let's say you group customer id by locations and so you decide to partition your delta table by location. Under the hood the file structure will be in different folders : tablename/partition_number/file.parquet. If a customer belonging to a specific location is updated... Will the number of records (files impacted) be only limited (at most) to the ones in the location partition's folder ?
@jameshsieh2682
3 жыл бұрын
Would you share the link to the blog post?
@neelred10
2 жыл бұрын
Cant find more details about table_changes function. Can the version/timestamp parameters be dynamic? This is crucial if we are trying to automate the process of getting the CDF used in our automated ETL processing.
@mdfarooii
4 жыл бұрын
Thanks , very nice ,Tried out insert/updates. but can't seem to make delete work as deleted rows are not propogated from the source table to the stream, waiting for the notebooks
@dennyglee
4 жыл бұрын
You can find the notebooks at github.com/databricks/tech-talks/tree/master/2020-04-30%20%7C%20Capturing%20Change%20Data%20from%20Delta. HTH!
@osamam.bahgat8962
4 жыл бұрын
@@dennygleeGetting this error when importing the html files, Import failed with error: Could not deserialize: Did not find a Databricks notebook in the HTML file. can you please advise since I couldn't find an answer in Databricks forums
@dennyglee
4 жыл бұрын
@@osamam.bahgat8962 Could you try to download the HTML files directly into your local folder (either via GitHub desktop or clicking RAW) and then uploading them directly?
@osamam.bahgat8962
4 жыл бұрын
@@dennyglee Thanks for your reply, downloading the RAW and manual upload works fine.
@ashokgupta-om7nb
4 жыл бұрын
We are working on the warehouse use case, where our source data gets updated once in a day, can we have this similar implementation for batch processing instead of stream?
@ShivamSingh-sm2oy
2 жыл бұрын
you can use sqoop
@karthikmuthyala1251
4 жыл бұрын
Github link contains only html not notebook , can you upload notebooks
@dennylee4934
4 жыл бұрын
These are Databricks HTML notebooks - you can upload them directly into Databricks Community Edition (free) and work within that environment. If you prefer the Python notebook, please create a GitHub issue so we can track it. HTH!
@sid0000009
4 жыл бұрын
Waiting for notebook to get hands dirty :)
@dennyglee
4 жыл бұрын
Here they are: github.com/databricks/tech-talks/tree/master/2020-04-30%20%7C%20Capturing%20Change%20Data%20from%20Delta
@dennyglee
4 жыл бұрын
@Wasim Ismail Oh sorry, we do not have access to the zoom chat any longer - could you provide the context of which links we were referring to (or what time stamp in the video) and I'll go find them.
@alexhelvig6105
3 жыл бұрын
@@dennyglee At 25:54 in the video you start talking about 3 video links you'll post that go more in depth regarding internals. If you could find those links and place in the description that would be great. Thank you!
@dennyglee
3 жыл бұрын
@@alexhelvig6105 Oh sure, you can find the video series at: databricks.com/discover/diving-into-delta-lake-talks. HTH!
@avnish.dixit_
3 жыл бұрын
It's better if you provide more content in Python because 68 percent people use Databricks with Python and in future this will increase.
@ew3995
3 жыл бұрын
There's a much easier way to do CDC with Delta lake than this
Пікірлер: 30