Advancing Spark - Working with Hive

Рет қаралды 11,683

Advancing Analytics

Жүктеу

Пікірлер: 25

@eversruud
3 жыл бұрын
Thank you for all the great tips about DELTA, Hives, Synapse and Databricks. It gave me a different view on Dataplatforms and Datalakes in the cloud and what the advantages and possibilities are with the technologies.
@TheMLaskowsky
2 жыл бұрын
You are talking a liitle bit to fast, but after watching second time everything was pretty clear. Great material, thank you for that !✌
@fenderbender28
2 жыл бұрын
love this excellent, approachable explanation!
@amjds1341
2 жыл бұрын
Great sessions
@AdvancingAnalytics
2 жыл бұрын
Thanks
@cweymouth1
3 жыл бұрын
We use this at my organization to have re-only versions of our production tables in a separate databricks workspace. Working with Hive is super nice in this way.
@AdvancingAnalytics
3 жыл бұрын
Without HIVE, so much of the way we build lakes doesn't quite work. Don't know why it took me so long to pop a quick vid together (I think I assumed I had already covered it!)
@tomaszrevi5693
3 жыл бұрын
hey! thanks! love your channel. Do you usually use the local workspace hive metastore, or rather try to create an external one (e.g. on Azure SQL DB), so you can share that across several workspaces? Or is there a way to reuse a workspace-scoped hive metastore in another workspace?
@AdvancingAnalytics
3 жыл бұрын
We usually use local workspaces out of sheer simplicity. We loved the idea of external when it was first rolled out, but it had too many limitations (originally you had to hardcode your connection string rather than use secrets...). They're a lot better now, but I think there are still a couple of limitations. I'd LOVE a better way to maintain local workspaces (so you can have some differences between environments) but selectively sync up selected databases across workspaces!
@EduardoSantos-gh1dk
3 жыл бұрын
Thank you for the awesome videos Simon! You cannot do query folding from Power BI to the Hive "tables" right? Do you know if this will eventually be available? It would be awesome to use incremental refresh, or even just do query folding on big tables!
@akhilannan
3 жыл бұрын
I think query folding already works with Databricks connector in PowerBI.
@AdvancingAnalytics
3 жыл бұрын
Yep, pretty sure this already works, but we can test it to find out! SQLAnalytics includes a query history, so if we make a few Power BI steps we should see the query get updated as it goes back to Databricks with the new folded clauses. I'll try it out...
@EduardoSantos-gh1dk
3 жыл бұрын
My bad, already tried it out and it does, in fact fold the query. In spark UI I can see that. It does not, however, let you see the native query in the power query steps on power bi.
@aymanalneser8802
3 жыл бұрын
Thanks for the great video as always Simon If i create hive from partitioned csv Should i point out that the file is partitioned when creating the table
@AdvancingAnalytics
3 жыл бұрын
If you're creating a hive table on top of data that is already partitioned then it will automatically infer the partitioning for you (in some scenarios, you need to run mcsk repair table if it doesn't pick it up properly!) If you are creating a blank table to insert data into, then your create table script should include the "partitioned by" clause!
@akhilannan
3 жыл бұрын
After registering a delta table, how do we keep the hive metastore updated with structural changes in delta lake, like adding/removing columns or changing datatype?
@phy2sll
3 жыл бұрын
That's all recorded in the Delta transaction log. No need to update hive.
@AdvancingAnalytics
3 жыл бұрын
Depends on the Databricks runtime, with 7.5 they added an automated refresh to keep Hive in Sync. With earlier versions, you can run refresh table to manually force a re-sync (or msck repair table if it's partition metadata)!
@frederikl8254
3 жыл бұрын
Great video! I'm still a bit confused about when data is stored in the DBFS vs only meta data (schema) is stored. From the video, I get the impression we can only let Databricks handle the metadata if we declare the table via SQL? Could you please help clarify. Or is it simply the matter of whether you specify a path for your table, that determines whether its data is managed or unmanaged? (also in pyspark)
@bankoftrustnwobot3218
3 жыл бұрын
I noticed that spark/hive does not support changing the column type from double to decimal (Operation not supported,... hardcoded in Spark). Does this mean I need to recreate the table and using MSCK to recover all partitions (data is stored in hdfs as parquet)?
@shamsuddinjunaid30
3 жыл бұрын
Hey Simon can we get the hive notebook?
@marcocaviezel2672
3 жыл бұрын
Hi Simon! Great video again! I want to copy a HIVE structure from one Databricks workspace to another. Do you know if this can be achieved with an existing function or do I have to write this manually? So far the biggest problem I encountered so far is that I can’t query the location/path on ADLS Gen2 which is where my delta files are located. Thanks and best regards! Marco
@AdvancingAnalytics
3 жыл бұрын
Hey! Sooo it's awkward currently, honestly, unless you use an external hive metastore which has it's own complications. We've built something in the past to do this, by running the "describe detail [table]" command and pulling out the folder location - if you've mounted the storage, then you'll also then need to use dbutils.fs.mounts() to translate the mount reference into the actual storage reference. Basically... it's possible but it's a pain, one that I'm hoping Unity Catalog with make steps towards solving when we see it out in preview! Simon
@marcocaviezel2672
3 жыл бұрын
@@AdvancingAnalytics Hi Simon! Thanks for the valuable hints. Since the mounting points were the same. If anybody encounters the same problem as I had here is my code snippet: df = spark.sql("SHOW TABLES FROM databaseName") for tbl in df.collect(): database = tbl.database table = tbl.tableName describeQuery = "DESCRIBE DETAIL {database}.{table}".format(database = database, table = table) # print(describeQuery) df2 = spark.sql(describeQuery) location = df2.first()['location'] print("CREATE TABLE IF NOT EXISTS {database}.{table} USING delta LOCATION '{location}';".format(database = database, table = table, location = location)) Cheers, Marco
@Sangeethsasidharanak
3 жыл бұрын
It would be really great if you can push your note books to git