Introduction to aws EMR

August 21, 2018

—

Recently I have to deal with data pipeline with large volumn of data. I was using python script first to make plenty of sessions query for each row of data because of cascading query between different tables. This caused bottleneck for only 400,000 rows of data which required almost 2 hrs of processing time. I have used different approches like batch querying the dimension table or even async programming to separate the compute and insert operation. But both turns out to be a not quite well performance. So I decided to turn to spark.

What you will learn from this blog?

I will take a note on how to use aws EMR for quick set up cluster for computing purpose.
some basic computing skill with spark in scala.

What is EMR?

EMR stands for Elastic Mapreduce.

How ?

You will have 3 options for submitting your jobs on the cluster.

via the web-ui
using the awscli
restful api

What can we do in this platform?

Streaming? ETL? Machine Learning?

Choosing instances

Master: coordinating different components inside the cluster.
Core:
Task:

instance type:

c4.xxxx: compute optimized
- Batch processing workloads
- Media transcoding
- High-performance web servers
- High-performance computing (HPC)
- Scientific modeling
- Dedicated gaming servers and ad serving engines Machine learning inference and other compute-intensive applications
d2.xxxx
i3.xxxx: storage optimized.
m4.xxxx: general purpose
r4.xxxx

https://amazonaws-china.com/ec2/instance-types/

Pricing

Spot instance

Price will be adjusted due to the current highest price in the current area.(Take about 5 to 10 minutes to set up)

On demand instance

Price is static.(Set up immediately)

https://docs.aws.amazon.com/zh_cn/AWSEC2/latest/UserGuide/using-spot-instances.html

how to run a script when boostraping?

add a custom jar reside at s3://region.elasticmapreduce/libs/script-runner/script-runner.jar and set the arguments.

Some thoughts that I have during using aws EMR?

how does data map from s3 to the instance(aka how does aws handle data stored in s3 to be read from instance)?
In the world of python, data should be better arraged in the form of dictionary while in the context of spark, data should be better thought of as RDD in some sense.
How does spot fleets work?
HBase on Amazon