Spark Streaming Join with Delta Lake Table (Slow Changing Data)

Опубликовано: 10 Июнь 2021
на канале: Dustin Vannoy

2,383

In this video I demo how you can join a streaming Spark DataFrame to a static DataFrame and have updates to the static DataFrame automatically loaded to the in memory lookup data. See below for more details and watch this video to see how it's done.

When joining a Spark stream to a batch (static) dataset for a lookup, most batch sources will not update in memory. If you update the batch data once a night, you need to restart the streaming query for the lookup data (static dataset) to be updated. This isn't too bad if it's once a day, but when data trickles in throughout the day we need a better solution.

With Delta Lake format, the static dataset will update in memory without restarting the stream. The video in this post shows an example of this in action. Delta Lake supports updates via the merge statement so you keep the data up to date in your file system and Spark will also update its in memory dataset.

Related article: https://dustinvannoy.com/2021/06/09/s...

More from Dustin:
Website: dustinvannoy.com
Twitter: @dustinvannoy
Github: https://github.com/datakickstart