![Soumil Shah](/img/default-banner.jpg)
- 1 708
- 6 555 688
Soumil Shah
United States
Приєднався 16 лис 2012
I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a UA-cam channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as data collection and processing Team Lead at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search
Hudi Using Spark SQL on AWS S3: Insert, Update, Deletes, Stored Procedures on AWS Glue Notebooks
soumilshah1995.blogspot.com/2024/06/apache-hudi-using-spark-sql-on-aws-s3.html
Переглядів: 47
Відео
How to Use Hudi Streamer on New EMR 7.1.0 Spark 3.5.1 and Hudi 0.14.1 | Hands-on Labs
Переглядів 35День тому
How to Use Hudi Streamer on New EMR 7.1.0 Spark 3.5.1 and Hudi 0.14.1 | Hands-on Labs Exercise Files github.com/soumilshah1995/Hudi-streamer-emr-7.1.0/blob/main/README.md 📚 Want to Learn DeltaStreamer? Dive into my 14-part series that will teach you everything you need to know about DeltaStreamer! We cover data ingestion from various sources like Parquet, JSON, CSV, Kafka, Pulsar, and much more...
How to Use Hudi Streamer with Hudi version 0.15.0 | Hands on Guide |
Переглядів 38День тому
How to Use Hudi Streamer with Hudi version 0.15.0 | Hands on Guide | Download The sample Dataset drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link Download Jar Files mvnrepository.com/artifact/org.apache.hudi/hudi-utilities-slim-bundle_2.12/0.15.0 Spark Submit github.com/soumilshah1995/apache-hudi-delta-streamer-labs/blob/main/E1/Submit Spark Job Want to Learn Delt...
How to Execute Postgres Stored procedures in Spark | Hands on Guide
Переглядів 63День тому
Exercise Files soumilshah1995.blogspot.com/2024/06/how-to-execute-postgres-stored.html
Learn How to Ingest Data from Hudi Incrementally hudi table changes into Postgres Using Spark
Переглядів 70День тому
If you're interested in learning how to ingest data from Hudi incrementally into Postgres using Spark, you're in the right place! We've prepared a detailed guide and exercises to help you understand and implement this process effectively. Exercises files: soumilshah1995.blogspot.com/2024/06/learn-how-to-ingest-data-from-hudi.html If you're curious about how to fetch Hudi commit time, check out ...
Universal Datalakes: Interoperability with Hudi, Iceberg, and Delta Tables with AWS Glue Notebooks
Переглядів 93День тому
Universal Datalakes: Interoperability with Hudi, Iceberg, and Delta Tables with AWS Glue Notebooks Exercersie Files soumilshah1995.blogspot.com/2024/06/universal-datalakes-interoperability.html Apache Hudi Apache XTable (Incubating) Onehouse
4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark
Переглядів 442 дні тому
Step by Step instructions www.linkedin.com/pulse/4-different-ways-fetch-apache-hudi-commit-time-python-soumil-shah-qapqf/?trackingId=3CGYlbkQSVqg5IIXSslvJA
OneTable to translate a Hudi table to Iceberg format and sync with Glue Catalog
Переглядів 572 дні тому
OneTable to translate a Hudi table to Iceberg format and sync with Glue Catalog Download Jar Files iceberg-aws-1.3.1.jar repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws/1.3.1/iceberg-aws-1.3.1.jar bundle-2.23.9.jar mvnrepository.com/artifact/software.amazon.awssdk/bundle/2.23.9 utilities-0.1.0-beta1-bundled.jar github.com/apache/incubator-xtable/packages/1986830 Detailed Blogs with Steps ...
Learn How to Run Apache X Table Sync Command on AWS Cloud Shell | Interoperate Hudi Iceberg delta
Переглядів 242 дні тому
Interoperate Hudi -Iceberg & Delta Learn How to Run Apache X Table Sync Command on AWS Cloud Shell Steps github.com/soumilshah1995/apache-x-table-sync-aws-cloud-shell
Learn How to Ingest XML files with AWS Glue into Hudi Datalakes | Step by Step guide
Переглядів 812 дні тому
Learn How to Ingest XML files with AWS Glue into Hudi Datalakes | Step by Step guide Exercises Files soumilshah1995.blogspot.com/2024/06/learn-how-to-process-xml-data-files-and.html
Hudi with Spark SQL for Beginners | Insert| Updates | Delete | incremental Query | Stored procedures
Переглядів 8614 днів тому
Exercises files github.com/soumilshah1995/Hudi-spark-sql-minio/blob/main/README.md
How we Utilized Hudi's Time Travel Query to Investigate Bid and Spend | Going Back in Time with Hudi
Переглядів 6514 днів тому
How we Utilized Apache Hudi's Time Travel Query to Investigate Bid and Spend | Going Back in Time with Hudi Read Blog www.linkedin.com/pulse/how-jobtarget-utilized-apache-hudis-time-travel-query-soumil-shah-slooe/?trackingId=ZtqHlFPaQTy4jmuL8ywANw Sample Labs to try soumilshah1995.blogspot.com/2024/06/hudi-time-travel-in-action.html Join Hudi Slack Channel zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g
Hudi Cleaning Process | hoodie.keep.min.commits and hoodie.keep.max.commits Explained
Переглядів 4514 днів тому
Exercise files soumilshah1995.blogspot.com/2024/06/hudi-cleaning-process.html
AWS Glue Tutorial: How to Filter and Exclude S3 Files while reading as Glue Dynamic Frame
Переглядів 12421 день тому
# Code Generate Fake data github.com/soumilshah1995/code-snippets/blob/main/generatefake_data.py # Sample code to read Data github.com/soumilshah1995/code-snippets/blob/main/read_data_glue_df_filter.py
How to Read S3 Partitioned Data as Columns in AWS Glue DF
Переглядів 13221 день тому
How to Read S3 Partitioned Data as Columns in AWS Glue DF Exercise Files github.com/soumilshah1995/code-snippets/tree/main
Multiple Spark Writers to Hudi tables | Hands on Labs
Переглядів 8521 день тому
Multiple Spark Writers to Hudi tables | Hands on Labs
Learn How to Ingest data from pulsar Topic into Hudi with DeltaStreamer | Hands on Labs
Переглядів 53Місяць тому
Learn How to Ingest data from pulsar Topic into Hudi with DeltaStreamer | Hands on Labs
Build Hudi Date Dimension in Minutes with Spark SQL Minio and Query with Trino
Переглядів 121Місяць тому
Build Hudi Date Dimension in Minutes with Spark SQL Minio and Query with Trino
Hudi Streamer implementing Slowly Changing Dimension Type 2 and Query Real Time Trino | Hands on
Переглядів 139Місяць тому
Hudi Streamer implementing Slowly Changing Dimension Type 2 and Query Real Time Trino | Hands on
Demo Video : Hudi Delta Streamer Implementing Slowly Changing Dimension and Query that using Trino
Переглядів 47Місяць тому
Demo Video : Hudi Delta Streamer Implementing Slowly Changing Dimension and Query that using Trino
DeltaStreamer with incremental ETL and Broadcast Joins for Faster ETL
Переглядів 143Місяць тому
DeltaStreamer with incremental ETL and Broadcast Joins for Faster ETL
Learn How to use Cloudwatch metrics with Hudi AWS Glue Jobs
Переглядів 118Місяць тому
Learn How to use Cloudwatch metrics with Hudi AWS Glue Jobs
Tips to Feel Valued at Work: Overcoming Unappreciation
Переглядів 139Місяць тому
Tips to Feel Valued at Work: Overcoming Unappreciation
How to Use Spark 3.5.1 on Kubernetes running locally | Step by Step Guide using Helm
Переглядів 106Місяць тому
How to Use Spark 3.5.1 on Kubernetes running locally | Step by Step Guide using Helm
Learn how to Spinup Trino on Kubernetes running Locally on Windows | Mac machine | Simple Guide
Переглядів 94Місяць тому
Learn how to Spinup Trino on Kubernetes running Locally on Windows | Mac machine | Simple Guide
Mastering ETL and Data Warehousing with AWS Glue
Переглядів 92Місяць тому
Mastering ETL and Data Warehousing with AWS Glue
Mastering Elasticsearch Your Comprehensive Guide to Shards, Performance Tuning, and More
Переглядів 96Місяць тому
Mastering Elasticsearch Your Comprehensive Guide to Shards, Performance Tuning, and More
Unleashing the Power of Serverless: Serving Gold Hudi Tables with AWS Lambda
Переглядів 144Місяць тому
Unleashing the Power of Serverless: Serving Gold Hudi Tables with AWS Lambda
#1 Stay Motivated and Learn: Strategies and Tips to Keep Going
Переглядів 69Місяць тому
#1 Stay Motivated and Learn: Strategies and Tips to Keep Going
#1 Unlocking the Future of Data Management: Introducing OneTable by OneHouse
Переглядів 42Місяць тому
#1 Unlocking the Future of Data Management: Introducing OneTable by OneHouse
hello brother, i need your help in something, can i get your ig or something to chat?
bro you dont even know what you are teaching....you are reading all the stuff from another screen,....
Appreciate the energy of the guy :)
how to do with mongodb ?
Brother i have been able to install pyttsx3 on my M1 macbook air, all u need to do is to add 'pip install py3-tts' since this version contains pyobjc 9.0.1version. I am feeling like a dumb commenting on a 5 years old video like this😅, but it might help someone else..
I have a csv file and when i am using concat function it automatically name unnamed group 1,2,3... Also the alignment gets messy with songle line of code How to fix it
Pass parameters from c# to python and parse the result to c# class object
Thank you for update, keep going!
Git hun link please?
Greetings Mr Soumil. I want to congratulate for this interesting video. I didn't know this HUDIs streaming capacity and it's awsome. I need to said about your example, It's a little confusing because for most common DE porceses we part form a log to a concrete table I mean there's an agregation. I mean I was expecting the totally reverse ( that new fields was inserted in the origin hudi table log like and then to agregate to the dimensional table to take just one register of each customer ). I need to read more about HUDI streaming. You Rock!!
Thanks glad your enjoyed it
Please help me to implement this project.
thank you so much, this is very helpful. keep doing what youre doing 🫡 hoping to see more long videos and comprehensive projects
Thank you very very much
Can you help me with something?
have you done oracle cdc through apache flink
Nop I assume process is similar
@@SoumilShah I go through the docs but it's not working at all
Hi Soumil, thanks a lot for this amazing content. I am starting in the world of data streaming and it’s a really useful case! I am facing an error accessing your code in GitHub. Do you know what can be te cause of it? Best
CDC part is missing. please add
route 53 failover with primary and secondary resources is active-passive, not active-active
In multithreading, creating multiple threads increases the chance of using more CPU cores (assuming each task takes 1 millisecond). However, Python's Global Interpreter Lock (GIL) prevents true parallelism for CPU-bound tasks within a single process. Only one thread can execute Python bytecode at a time. For I/O-bound tasks, other threads can utilize other CPU and execute bytecode meanwhile, creating the illusion of concurrency. However, managing multiple threads can be resource-intensive due to memory overhead. In contrast, async utilizes a single thread with an event loop. You submit tasks to the event loop, and it executes them sequentially. When a task encounters an I/O operation, the event loop tracks its progress. The main thread then executes other tasks in the queue, if any. Once an I/O task completes, the main thread continues executing the remaining CPU-bound part of the task.
Appreciated
Very interesting. Just curious on how can we manage the costs of cloudwatch. since cloudwatch is one of those services that at the beginning is quite cheap. but with time (thousands of tables in my case) those costs add up to quite a lot
helpfull video soumil! But is still not really clear what the relation is between all of the 3 configuration. the max is understood. but what about the min and commits.retained? ATM I am only using commits retained and I'm exploring these configuration if I need to add them or not
Love your videos Soumil❤
Thanks mate
thanks a lott mann
Really helpful for those who want to learn airflow.
@Soumil :- your voice is quite clear which software you used to record and show these end to end
I use a tool called OBS Microphone I use is YeTTI blue
did this nibba just pronounced chile as cha-aisle
How to do interpolation for categorical variable
Hi, this setup spike my billing very high, The setup was to build lambda function to read the latest file from the s3 dir and make transformation then finally to s3 target dir, So this all setup with the python script has to run once the s3 notification to lambda function that an file just came to s3. But it went into a loop and made the s3 and lambda billing spike Let me knew what is the issue in my setup that i didn't noticed at first while running this python script in lambda
This is such a great video I have ever found! I need do a log tracker for my team with kinesis/firehose/S3/Anthena as you did, but I have a question is that, can we connect S3 with Anthna directly? Or say, we have to need AWS glue to do so ?
be more delicate with your keyboard
This is amazing. I learned a lot. I want to come to India to study Data Analytics
You want to use alias record for the route 53 as it’s free for any number of invocations for any aws resource. Instead of typical A record.
How the duplication is handled ? If I have a 2 source tables order and orderDetails - then when changes are done in both tables - how to combine the data and save ? Or it des not required to combine - only in the Athena level or glue job we can combine the 2 tables and wrote a single records to s3 ? Can you please explain. soe where I read - from kafka topic - a kafka stream consumer can be used to perform the join operation before send it to another topic
Amazing FREE content, but when covering the code, we don't need to see your wonderful face. We need to see the code.
How to find out the top 10 records?
After 20 minutes - I get: Error Category: UNCLASSIFIED_ERROR; An error occurred while calling o98.purgeS3Path. Unable to execute HTTP request: The target server failed to respond. I have been trying to figure it out but it happens everytime after 22 minutes. I was able to run the job earlier and it worked on 2 million objects.
शानदार👌
Try using the cloud itseld instead of the local elastic
How to make the query dynamically passed instead of the one query and how to make the fitler inside the query dynamic also
where is the link ?
Heeey Soumil! Been following you are a while now and think you channel is great! I am using dbt with a thriftserver, on my local server, trying to push data to Azure adls gen2 as delta tables. But struggling to get it to work. Would be awesome if you could do a video about this!
Very Informative. QQ- is this table to be re-created every time underlying new S3 files are added in data lake or it captures new data automatically?
tnxxx buddy :)
Lol did he just clicked through all. I don't mind if you take some time to explain