Demonstration and explanations on how to use Cloudera DataFlow Functions in AWS in order to setup an AWS Lambda function triggered by files landing into an S3 bucket and push the data of the files into an Iceberg table in Cloudera Data Warehouse (CDW) in Public Cloud.
Resources:
Cloudera DataFlow Functions - docs.cloudera.com/dataflow/cloud/functions.html
Cloudera DFF in AWS - docs.cloudera.com/dataflow/cloud/aws-lambda-functions/topics/cdf-create-aws-lambda-function.html
Things to keep in mind for this use case:
- Deploy the Lambda in the VPC of the CDP environment, provide private subnets with access to the internet
- Kerberos configuration with the krb5.conf (to update with FreeIPA IPs) via a layer, and add the corresponding environment variable
- hive-site.xml file (change _HOST by the FQDN of master nodes of the DataLake for the HMS Kerberos principal) and provide it with the layer
- core-site.xml file (change the RAZ server hosts to the ones in the DataLake to not depend on DataHub clusters) and provide it with the layer
- add the truststore to the layer and add the corresponding environment variable
- update the security group to allow traffic coming from the elastic IP attached to the NAT gateway
Something I forgot to mention at the end of the video: I created a secret in AWS Secrets Manager with the same name as the name of the parameter context associated to my flow definition and I update the permissions attached to my Lambda's role to allow read access to the secret values. That's where I'm storing the sensitive parameters of my flow.
Hope you enjoyed the video, feel free to comment / ask questions! You can also suggest what should be about the next one...
Thanks for watching!
Негізгі бет Ғылым және технология S3 to Iceberg tables in CDW - Cloudera DataFlow Functions - AWS Lambda
Пікірлер