Негізгі бет AWS Tutorials - Using Amazon EMR with AWS Glue Catalog

Күн бұрын

AWS Tutorials - Using Amazon EMR with AWS Glue Catalog

Рет қаралды 6,866

1 1

Пікірлер: 23

@VamsiKrishna-vf5gm
2 жыл бұрын
wonderful video. really great presentation. Thanks a lot for sharing knowledge
@AWSTutorialsOnline
2 жыл бұрын
Glad you liked it!
@geronimojordan759
2 жыл бұрын
Excelent tutorial keep explaining like this its an excellent job thanks for your time!
@AWSTutorialsOnline
2 жыл бұрын
Thanks, will do!
@klzo4785
3 жыл бұрын
Good job! Dojo I have a question. Could you plz give any use cases that why we are going to use EMR step for doing ETL processing?
@AWSTutorialsOnline
3 жыл бұрын
In data lake, generally people keep data in three stages - raw, cleansed and consumable / harmonized. You can use ETL Step to move data between these stages. EMR is specially used when you try to use open source such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto in the step for the data transformation. Hope it helps.
@BhanuNatva
5 ай бұрын
sir, have a qq, what if the 1st record in side the file in S3 doesnt have any header record in case of a CSV file. ? does crawler still be able to derrive the data type of the columns based on data ?
@renatobibiano698
3 жыл бұрын
very good tutorial, but missing one part... what if I want to update a table in the glue catalog after saving those jsons (last part of step 8), how can I do that?
@AWSTutorialsOnline
3 жыл бұрын
you want to update table in database or in s3 bucket? Also you want to update entire dataset or only few rows?
@saikumar-vw4le
3 жыл бұрын
I have a question, I don’t know whether it is related to this video I follow your videos on EMR but the thing is how do we run transient cluster ? Can you please make a video on it if possible?
@AWSTutorialsOnline
3 жыл бұрын
does transient cluster means - you launch a cluster to complete a step or task and then the cluster terminates itself. Please confirm.
@saikumar-vw4le
3 жыл бұрын
@@AWSTutorialsOnline exactly. Can you please do a video on it ?
@AWSTutorialsOnline
3 жыл бұрын
sure
@vivekb405
3 жыл бұрын
great tutorial!! Thank you. Since few data types are not supported by pyspark, getting this error: "Parquet type not supported: INT32" with int and timestamp. What would be the best practice to handle this?
@AWSTutorialsOnline
3 жыл бұрын
This is tricky one as it is not very well documented. Take help from AWS Supports.
@JhonOlivares
2 жыл бұрын
How to use glue data catalog with custom pyspark cluster on EC2? NO EMR
@jonathanduran2921
Жыл бұрын
Why is the header also appearing as the first row?
@keshavamugulursrinivasiyen5502
3 жыл бұрын
Got Kernal Error while opening the Jyputernotebook in EMR section. Release 5.32.0, 5.30.0 & 5.29.0 , the problem same. What should i do now? Neee your help.
@AWSTutorialsOnline
3 жыл бұрын
what error you got?
@michellesantos435
3 жыл бұрын
When running Step 5 of coding in Notebook, got permission error: An error was encountered: 'Insufficient Lake Formation permission(s) on DB (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 84c14fd9-ae52-46a9-ad03-938e149e2a6a; Proxy: null);'
@AWSTutorialsOnline
3 жыл бұрын
EMR runs with an IAM Role. Give this role permission in Lake Formation for the catalog.