Soumil Shah
Soumil Shah
  • 1 708
  • 6 555 688

Відео

How to Use Hudi Streamer on New EMR 7.1.0 Spark 3.5.1 and Hudi 0.14.1 | Hands-on Labs
Переглядів 35День тому
How to Use Hudi Streamer on New EMR 7.1.0 Spark 3.5.1 and Hudi 0.14.1 | Hands-on Labs Exercise Files github.com/soumilshah1995/Hudi-streamer-emr-7.1.0/blob/main/README.md 📚 Want to Learn DeltaStreamer? Dive into my 14-part series that will teach you everything you need to know about DeltaStreamer! We cover data ingestion from various sources like Parquet, JSON, CSV, Kafka, Pulsar, and much more...
How to Use Hudi Streamer with Hudi version 0.15.0 | Hands on Guide |
Переглядів 38День тому
How to Use Hudi Streamer with Hudi version 0.15.0 | Hands on Guide | Download The sample Dataset drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link Download Jar Files mvnrepository.com/artifact/org.apache.hudi/hudi-utilities-slim-bundle_2.12/0.15.0 Spark Submit github.com/soumilshah1995/apache-hudi-delta-streamer-labs/blob/main/E1/Submit Spark Job Want to Learn Delt...
How to Execute Postgres Stored procedures in Spark | Hands on Guide
Переглядів 63День тому
Exercise Files soumilshah1995.blogspot.com/2024/06/how-to-execute-postgres-stored.html
Learn How to Ingest Data from Hudi Incrementally hudi table changes into Postgres Using Spark
Переглядів 70День тому
If you're interested in learning how to ingest data from Hudi incrementally into Postgres using Spark, you're in the right place! We've prepared a detailed guide and exercises to help you understand and implement this process effectively. Exercises files: soumilshah1995.blogspot.com/2024/06/learn-how-to-ingest-data-from-hudi.html If you're curious about how to fetch Hudi commit time, check out ...
Universal Datalakes: Interoperability with Hudi, Iceberg, and Delta Tables with AWS Glue Notebooks
Переглядів 93День тому
Universal Datalakes: Interoperability with Hudi, Iceberg, and Delta Tables with AWS Glue Notebooks Exercersie Files soumilshah1995.blogspot.com/2024/06/universal-datalakes-interoperability.html Apache Hudi Apache XTable (Incubating) Onehouse
4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark
Переглядів 442 дні тому
Step by Step instructions www.linkedin.com/pulse/4-different-ways-fetch-apache-hudi-commit-time-python-soumil-shah-qapqf/?trackingId=3CGYlbkQSVqg5IIXSslvJA
OneTable to translate a Hudi table to Iceberg format and sync with Glue Catalog
Переглядів 572 дні тому
OneTable to translate a Hudi table to Iceberg format and sync with Glue Catalog Download Jar Files iceberg-aws-1.3.1.jar repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws/1.3.1/iceberg-aws-1.3.1.jar bundle-2.23.9.jar mvnrepository.com/artifact/software.amazon.awssdk/bundle/2.23.9 utilities-0.1.0-beta1-bundled.jar github.com/apache/incubator-xtable/packages/1986830 Detailed Blogs with Steps ...
Learn How to Run Apache X Table Sync Command on AWS Cloud Shell | Interoperate Hudi Iceberg delta
Переглядів 242 дні тому
Interoperate Hudi -Iceberg & Delta Learn How to Run Apache X Table Sync Command on AWS Cloud Shell Steps github.com/soumilshah1995/apache-x-table-sync-aws-cloud-shell
Learn How to Ingest XML files with AWS Glue into Hudi Datalakes | Step by Step guide
Переглядів 812 дні тому
Learn How to Ingest XML files with AWS Glue into Hudi Datalakes | Step by Step guide Exercises Files soumilshah1995.blogspot.com/2024/06/learn-how-to-process-xml-data-files-and.html
Hudi with Spark SQL for Beginners | Insert| Updates | Delete | incremental Query | Stored procedures
Переглядів 8614 днів тому
Exercises files github.com/soumilshah1995/Hudi-spark-sql-minio/blob/main/README.md
How we Utilized Hudi's Time Travel Query to Investigate Bid and Spend | Going Back in Time with Hudi
Переглядів 6514 днів тому
How we Utilized Apache Hudi's Time Travel Query to Investigate Bid and Spend | Going Back in Time with Hudi Read Blog www.linkedin.com/pulse/how-jobtarget-utilized-apache-hudis-time-travel-query-soumil-shah-slooe/?trackingId=ZtqHlFPaQTy4jmuL8ywANw Sample Labs to try soumilshah1995.blogspot.com/2024/06/hudi-time-travel-in-action.html Join Hudi Slack Channel zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g
Hudi Cleaning Process | hoodie.keep.min.commits and hoodie.keep.max.commits Explained
Переглядів 4514 днів тому
Exercise files soumilshah1995.blogspot.com/2024/06/hudi-cleaning-process.html
AWS Glue Tutorial: How to Filter and Exclude S3 Files while reading as Glue Dynamic Frame
Переглядів 12421 день тому
# Code Generate Fake data github.com/soumilshah1995/code-snippets/blob/main/generatefake_data.py # Sample code to read Data github.com/soumilshah1995/code-snippets/blob/main/read_data_glue_df_filter.py
How to Read S3 Partitioned Data as Columns in AWS Glue DF
Переглядів 13221 день тому
How to Read S3 Partitioned Data as Columns in AWS Glue DF Exercise Files github.com/soumilshah1995/code-snippets/tree/main
Multiple Spark Writers to Hudi tables | Hands on Labs
Переглядів 8521 день тому
Multiple Spark Writers to Hudi tables | Hands on Labs
Learn How to Ingest data from pulsar Topic into Hudi with DeltaStreamer | Hands on Labs
Переглядів 53Місяць тому
Learn How to Ingest data from pulsar Topic into Hudi with DeltaStreamer | Hands on Labs
Build Hudi Date Dimension in Minutes with Spark SQL Minio and Query with Trino
Переглядів 121Місяць тому
Build Hudi Date Dimension in Minutes with Spark SQL Minio and Query with Trino
Hudi Streamer implementing Slowly Changing Dimension Type 2 and Query Real Time Trino | Hands on
Переглядів 139Місяць тому
Hudi Streamer implementing Slowly Changing Dimension Type 2 and Query Real Time Trino | Hands on
Demo Video : Hudi Delta Streamer Implementing Slowly Changing Dimension and Query that using Trino
Переглядів 47Місяць тому
Demo Video : Hudi Delta Streamer Implementing Slowly Changing Dimension and Query that using Trino
DeltaStreamer with incremental ETL and Broadcast Joins for Faster ETL
Переглядів 143Місяць тому
DeltaStreamer with incremental ETL and Broadcast Joins for Faster ETL
Learn How to use Cloudwatch metrics with Hudi AWS Glue Jobs
Переглядів 118Місяць тому
Learn How to use Cloudwatch metrics with Hudi AWS Glue Jobs
Tips to Feel Valued at Work: Overcoming Unappreciation
Переглядів 139Місяць тому
Tips to Feel Valued at Work: Overcoming Unappreciation
How to Use Spark 3.5.1 on Kubernetes running locally | Step by Step Guide using Helm
Переглядів 106Місяць тому
How to Use Spark 3.5.1 on Kubernetes running locally | Step by Step Guide using Helm
Learn how to Spinup Trino on Kubernetes running Locally on Windows | Mac machine | Simple Guide
Переглядів 94Місяць тому
Learn how to Spinup Trino on Kubernetes running Locally on Windows | Mac machine | Simple Guide
Mastering ETL and Data Warehousing with AWS Glue
Переглядів 92Місяць тому
Mastering ETL and Data Warehousing with AWS Glue
Mastering Elasticsearch Your Comprehensive Guide to Shards, Performance Tuning, and More
Переглядів 96Місяць тому
Mastering Elasticsearch Your Comprehensive Guide to Shards, Performance Tuning, and More
Unleashing the Power of Serverless: Serving Gold Hudi Tables with AWS Lambda
Переглядів 144Місяць тому
Unleashing the Power of Serverless: Serving Gold Hudi Tables with AWS Lambda
#1 Stay Motivated and Learn: Strategies and Tips to Keep Going
Переглядів 69Місяць тому
#1 Stay Motivated and Learn: Strategies and Tips to Keep Going
#1 Unlocking the Future of Data Management: Introducing OneTable by OneHouse
Переглядів 42Місяць тому
#1 Unlocking the Future of Data Management: Introducing OneTable by OneHouse

КОМЕНТАРІ

  • @CartoonFlexTube
    @CartoonFlexTube 21 годину тому

    hello brother, i need your help in something, can i get your ig or something to chat?

  • @CASLOAcademy
    @CASLOAcademy 22 години тому

    bro you dont even know what you are teaching....you are reading all the stuff from another screen,....

  • @mikitaarabei
    @mikitaarabei 2 дні тому

    Appreciate the energy of the guy :)

  • @shivendrakaulwar
    @shivendrakaulwar 2 дні тому

    how to do with mongodb ?

  • @abdullahsaleem4768
    @abdullahsaleem4768 2 дні тому

    Brother i have been able to install pyttsx3 on my M1 macbook air, all u need to do is to add 'pip install py3-tts' since this version contains pyobjc 9.0.1version. I am feeling like a dumb commenting on a 5 years old video like this😅, but it might help someone else..

  • @AyaanKhan-rh5vx
    @AyaanKhan-rh5vx 3 дні тому

    I have a csv file and when i am using concat function it automatically name unnamed group 1,2,3... Also the alignment gets messy with songle line of code How to fix it

  • @syedirfanalichannel
    @syedirfanalichannel 3 дні тому

    Pass parameters from c# to python and parse the result to c# class object

  • @andriifadieiev9757
    @andriifadieiev9757 5 днів тому

    Thank you for update, keep going!

  • @Pillalurameeru
    @Pillalurameeru 5 днів тому

    Git hun link please?

  • @juanestebanagudeloagudelo9003
    @juanestebanagudeloagudelo9003 7 днів тому

    Greetings Mr Soumil. I want to congratulate for this interesting video. I didn't know this HUDIs streaming capacity and it's awsome. I need to said about your example, It's a little confusing because for most common DE porceses we part form a log to a concrete table I mean there's an agregation. I mean I was expecting the totally reverse ( that new fields was inserted in the origin hudi table log like and then to agregate to the dimensional table to take just one register of each customer ). I need to read more about HUDI streaming. You Rock!!

    • @SoumilShah
      @SoumilShah 6 днів тому

      Thanks glad your enjoyed it

  • @mr.av_ff
    @mr.av_ff 9 днів тому

    Please help me to implement this project.

  • @oleng99
    @oleng99 10 днів тому

    thank you so much, this is very helpful. keep doing what youre doing 🫡 hoping to see more long videos and comprehensive projects

    • @SoumilShah
      @SoumilShah 10 днів тому

      Thank you very very much

  • @miraf267
    @miraf267 11 днів тому

    Can you help me with something?

  • @himanshumahajan765
    @himanshumahajan765 13 днів тому

    have you done oracle cdc through apache flink

    • @SoumilShah
      @SoumilShah 12 днів тому

      Nop I assume process is similar

    • @himanshumahajan765
      @himanshumahajan765 12 днів тому

      @@SoumilShah I go through the docs but it's not working at all

  • @miguelgranica5085
    @miguelgranica5085 14 днів тому

    Hi Soumil, thanks a lot for this amazing content. I am starting in the world of data streaming and it’s a really useful case! I am facing an error accessing your code in GitHub. Do you know what can be te cause of it? Best

  • @mugilvannank392
    @mugilvannank392 15 днів тому

    CDC part is missing. please add

  • @hallielam
    @hallielam 15 днів тому

    route 53 failover with primary and secondary resources is active-passive, not active-active

  • @sm1le_with_me
    @sm1le_with_me 15 днів тому

    In multithreading, creating multiple threads increases the chance of using more CPU cores (assuming each task takes 1 millisecond). However, Python's Global Interpreter Lock (GIL) prevents true parallelism for CPU-bound tasks within a single process. Only one thread can execute Python bytecode at a time. For I/O-bound tasks, other threads can utilize other CPU and execute bytecode meanwhile, creating the illusion of concurrency. However, managing multiple threads can be resource-intensive due to memory overhead. In contrast, async utilizes a single thread with an event loop. You submit tasks to the event loop, and it executes them sequentially. When a task encounters an I/O operation, the event loop tracks its progress. The main thread then executes other tasks in the queue, if any. Once an I/O task completes, the main thread continues executing the remaining CPU-bound part of the task.

  • @ZTAnderson88
    @ZTAnderson88 15 днів тому

    Appreciated

  • @juanfelipeamayaramirez3455
    @juanfelipeamayaramirez3455 16 днів тому

    Very interesting. Just curious on how can we manage the costs of cloudwatch. since cloudwatch is one of those services that at the beginning is quite cheap. but with time (thousands of tables in my case) those costs add up to quite a lot

  • @juanfelipeamayaramirez3455
    @juanfelipeamayaramirez3455 16 днів тому

    helpfull video soumil! But is still not really clear what the relation is between all of the 3 configuration. the max is understood. but what about the min and commits.retained? ATM I am only using commits retained and I'm exploring these configuration if I need to add them or not

  • @UtkarshKoppikar
    @UtkarshKoppikar 16 днів тому

    Love your videos Soumil❤

  • @hansinibogade1372
    @hansinibogade1372 16 днів тому

    thanks a lott mann

  • @christoptimist
    @christoptimist 19 днів тому

    Really helpful for those who want to learn airflow.

  • @sharemomentsindian
    @sharemomentsindian 19 днів тому

    @Soumil :- your voice is quite clear which software you used to record and show these end to end

    • @SoumilShah
      @SoumilShah 19 днів тому

      I use a tool called OBS Microphone I use is YeTTI blue

  • @ronemchowry180
    @ronemchowry180 20 днів тому

    did this nibba just pronounced chile as cha-aisle

  • @Ayanshedipelly2312
    @Ayanshedipelly2312 20 днів тому

    How to do interpolation for categorical variable

  • @techaisolution
    @techaisolution 22 дні тому

    Hi, this setup spike my billing very high, The setup was to build lambda function to read the latest file from the s3 dir and make transformation then finally to s3 target dir, So this all setup with the python script has to run once the s3 notification to lambda function that an file just came to s3. But it went into a loop and made the s3 and lambda billing spike Let me knew what is the issue in my setup that i didn't noticed at first while running this python script in lambda

  • @user-vt2hi6hw9j
    @user-vt2hi6hw9j 22 дні тому

    This is such a great video I have ever found! I need do a log tracker for my team with kinesis/firehose/S3/Anthena as you did, but I have a question is that, can we connect S3 with Anthna directly? Or say, we have to need AWS glue to do so ?

  • @hyoshi7138
    @hyoshi7138 23 дні тому

    be more delicate with your keyboard

  • @Young-Prof
    @Young-Prof 23 дні тому

    This is amazing. I learned a lot. I want to come to India to study Data Analytics

  • @maxwellcyrus5828
    @maxwellcyrus5828 25 днів тому

    You want to use alias record for the route 53 as it’s free for any number of invocations for any aws resource. Instead of typical A record.

  • @shankhabhattacharya7617
    @shankhabhattacharya7617 25 днів тому

    How the duplication is handled ? If I have a 2 source tables order and orderDetails - then when changes are done in both tables - how to combine the data and save ? Or it des not required to combine - only in the Athena level or glue job we can combine the 2 tables and wrote a single records to s3 ? Can you please explain. soe where I read - from kafka topic - a kafka stream consumer can be used to perform the join operation before send it to another topic

  • @1____-____1
    @1____-____1 26 днів тому

    Amazing FREE content, but when covering the code, we don't need to see your wonderful face. We need to see the code.

  • @TheSarfarazahmed
    @TheSarfarazahmed 26 днів тому

    How to find out the top 10 records?

  • @jean-pierrefortin3190
    @jean-pierrefortin3190 26 днів тому

    After 20 minutes - I get: Error Category: UNCLASSIFIED_ERROR; An error occurred while calling o98.purgeS3Path. Unable to execute HTTP request: The target server failed to respond. I have been trying to figure it out but it happens everytime after 22 minutes. I was able to run the job earlier and it worked on 2 million objects.

  • @TheSarfarazahmed
    @TheSarfarazahmed 26 днів тому

    शानदार👌

  • @KishorKumar-lw4ep
    @KishorKumar-lw4ep 28 днів тому

    Try using the cloud itseld instead of the local elastic

  • @KishorKumar-lw4ep
    @KishorKumar-lw4ep 28 днів тому

    How to make the query dynamically passed instead of the one query and how to make the fitler inside the query dynamic also

  • @KishorKumar-lw4ep
    @KishorKumar-lw4ep 28 днів тому

    where is the link ?

  • @rasmusandreasson1548
    @rasmusandreasson1548 29 днів тому

    Heeey Soumil! Been following you are a while now and think you channel is great! I am using dbt with a thriftserver, on my local server, trying to push data to Azure adls gen2 as delta tables. But struggling to get it to work. Would be awesome if you could do a video about this!

  • @serenad5565
    @serenad5565 29 днів тому

    Very Informative. QQ- is this table to be re-created every time underlying new S3 files are added in data lake or it captures new data automatically?

  • @dev_monu
    @dev_monu 29 днів тому

    tnxxx buddy :)

  • @masteradvisor594
    @masteradvisor594 29 днів тому

    Lol did he just clicked through all. I don't mind if you take some time to explain