Spark and Scala Introduction

February 21, 2020

Spark and Scala Introduction

Rainbow Training Institute provides the Best Apache Spark Scala Online Training Course Certification. We are Offering Spark and Scala Course classroom training And Scala Online Training in Hyderabad.we will deliver courses 100% Practical and Spark scala Real-Time project training. Complete Suite of spark Scala training videos.

What is Apache Spark? An Introduction
Spark is an Apache project advertised as “lightning fast cluster computing”. It has a thriving open-source community and is the most active Apache project at the moment.

Spark provides a faster and more general data processing platform. Spark lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. Last year, Spark took over Hadoop by completing the 100 TB Daytona GraySort contest 3x faster on one tenth the number of machines and it also became the fastest open source engine for sorting a petabyte.

Spark also makes it possible to write code more quickly as you have over 80 high-level operators at your disposal. To demonstrate this, let’s have a look at the “Hello World!” of BigData: the Word Count example. Written in Java for MapReduce it has around 50 lines of code, whereas in Spark (and Scala) you can do it as simply as this:

sparkContext.textFile("hdfs://...")
.flatMap(line => line.split(" "))
.map(word => (word, 1)).reduceByKey(_ + _)
.saveAsTextFile("hdfs://...")
Another important aspect when learning how to use Apache Spark is the interactive shell (REPL) which it provides out-of-the box. Using REPL, one can test the outcome of each line of code without first needing to code and execute the entire job. The path to working code is thus much shorter and ad-hoc data analysis is made possible.

Additional key features of Spark include:

Currently provides APIs in Scala, Java, and Python, with support for other languages (such as R) on the way
Integrates well with the Hadoop ecosystem and data sources (HDFS, Amazon S3, Hive, HBase, Cassandra, etc.)
Can run on clusters managed by Hadoop YARN or Apache Mesos, and can also run standalone
The Spark core is complemented by a set of powerful, higher-level libraries which can be seamlessly used in the same application. These libraries currently include SparkSQL, Spark Streaming, MLlib (for machine learning), and GraphX, each of which is further detailed in this article. Additional Spark libraries and extensions are currently under development as well.

spark libraries and extensions

Spark Core
Spark Core is the base engine for large-scale parallel and distributed data processing. It is responsible for:

memory management and fault recovery
scheduling, distributing and monitoring jobs on a cluster
interacting with storage systems
Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. An RDD can contain any type of object and is created by loading an external dataset or distributing a collection from the driver program.

RDDs support two types of operations:

Transformations are operations (such as map, filter, join, union, and so on) that are performed on an RDD and which yield a new RDD containing the result.
Actions are operations (such as reduce, count, first, and so on) that return a value after running a computation on an RDD.
Transformations in Spark are “lazy”, meaning that they do not compute their results right away. Instead, they just “remember” the operation to be performed and the dataset (e.g., file) to which the operation is to be performed. The transformations are only actually computed when an action is called and the result is returned to the driver program. This design enables Spark to run more efficiently. For example, if a big file was transformed in various ways and passed to first action, Spark would only process and return the result for the first line, rather than do the work for the entire file.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist or cache method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.

SparkSQL
SparkSQL is a Spark component that supports querying data either via SQL or via the Hive Query Language. It originated as the Apache Hive port to run on top of Spark (in place of MapReduce) and is now integrated with the Spark stack. In addition to providing support for various data sources, it makes it possible to weave SQL queries with code transformations which results in a very powerful tool. Below is an example of a Hive compatible query:

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
Spark Streaming
Spark Streaming supports real time processing of streaming data, such as production web server log files (e.g. Apache Flume and HDFS/S3), social media like Twitter, and various messaging queues like Kafka. Under the hood, Spark Streaming receives the input data streams and divides the data into batches. Next, they get processed by the Spark engine and generate final stream of results in batches, as depicted below.

spark streaming

The Spark Streaming API closely matches that of the Spark Core, making it easy for programmers to work in the worlds of both batch and streaming data.

MLlib
MLlib is a machine learning library that provides various algorithms designed to scale out on a cluster for classification, regression, clustering, collaborative filtering, and so on (check out Toptal’s article on machine learning for more information on that topic). Some of these algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering (and more on the way). Apache Mahout (a machine learning library for Hadoop) has already turned away from MapReduce and joined forces on Spark MLlib.

GraphX
graphx

GraphX is a library for manipulating graphs and performing graph-parallel operations. It provides a uniform tool for ETL, exploratory analysis and iterative graph computations. Apart from built-in operations for graph manipulation, it provides a library of common graph algorithms such as PageRank.

How to Use Apache Spark: Event Detection Use Case
Now that we have answered the question “What is Apache Spark?”, let’s think of what kind of problems or challenges it could be used for most effectively.

I came across an article recently about an experiment to detect an earthquake by analyzing a Twitter stream. Interestingly, it was shown that this technique was likely to inform you of an earthquake in Japan quicker than the Japan Meteorological Agency. Even though they used different technology in their article, I think it is a great example to see how we could put Spark to use with simplified code snippets and without the glue code.
What is Apache Spark? An Introduction

Spark is an Apache venture promoted as "exceptionally quick bunch figuring". It has a flourishing open-source network and is the most dynamic Apache venture right now.

Spark gives a quicker and increasingly broad data preparing stage. Spark lets you run programs up to 100x quicker in memory, or 10x quicker on plate, than Hadoop. A year ago, Spark took over Hadoop by finishing the 100 TB Daytona GraySort challenge 3x quicker on one tenth the quantity of machines and it likewise turned into the quickest open source motor for arranging a petabyte.

Spark additionally makes it conceivable to compose code all the more rapidly as you have more than 80 significant level administrators available to you. To show this current, how about we examine the "Welcome World!" of BigData: the Word Count model. Written in Java for MapReduce it has around 50 lines of code, though in Spark (and Scala) you can do it as basically as this:

sparkContext.textFile("hdfs://...")

.flatMap(line => line.split(" "))

.map(word => (word, 1)).reduceByKey(_ + _)

.saveAsTextFile("hdfs://...")

Another significant angle when figuring out how to utilize Apache Spark is the intelligent shell (REPL) which it gives out-of-the container. Utilizing REPL, one can test the result of each line of code without first expecting to code and execute the whole occupation. The way to working code is in this manner a lot shorter and impromptu data investigation is made conceivable.

Extra key highlights of Spark include:

At present gives APIs in Scala, Java, and Python, with help for different dialects, (for example, R) in transit

Incorporates well with the Hadoop biological system and data sources (HDFS, Amazon S3, Hive, HBase, Cassandra, and so forth.)

Can run on groups oversaw by Hadoop YARN or Apache Mesos, and can likewise run independent

The Spark core is supplemented by a lot of ground-breaking, more significant level libraries which can be flawlessly utilized in a similar application. These libraries at present incorporate SparkSQL, Spark Streaming, MLlib (for AI), and GraphX, every one of which is additionally nitty gritty right now. Extra Spark libraries and expansions are as of now being worked on too.

spark libraries and augmentations

Spark Core

Spark Core is the base motor for enormous scope equal and conveyed data preparing. It is answerable for:

memory the executives and issue recuperation

planning, conveying and observing occupations on a group

communicating with capacity frameworks

Spark presents the idea of a RDD (Resilient Distributed Dataset), an unchanging shortcoming tolerant, conveyed assortment of articles that can be worked on in equal. A RDD can contain any sort of item and is made by stacking an outside dataset or conveying an assortment from the driver program.

RDDs bolster two kinds of tasks:

Changes are activities, (for example, map, channel, join, association, etc) that are performed on a RDD and which yield another RDD containing the outcome.

Activities are tasks, (for example, lessen, check, first, etc) that arrival an incentive subsequent to running a calculation on a RDD.

Changes in Spark are "lethargic", implying that they don't figure their outcomes immediately. Rather, they simply "recall" the activity to be performed and the dataset (e.g., document) to which the activity is to be performed. The changes are possibly really figured when an activity is called and the outcome is come back to the driver program. This plan empowers Spark to run all the more productively. For instance, if a major document was changed in different manners and went to initially activity, Spark would just process and return the outcome for the primary line, as opposed to accomplish the work for the whole record.

As a matter of course, each changed RDD might be recomputed each time you run an activity on it. Be that as it may, you may likewise endure a RDD in memory utilizing the persevere or reserve technique, in which case Spark will keep the components around on the bunch for a lot quicker access whenever you inquiry it.

SparkSQL

SparkSQL is a Spark segment that supports questioning data either by means of SQL or through the Hive Query Language. It started as the Apache Hive port to run over Spark (instead of MapReduce) and is presently incorporated with the Spark stack. Notwithstanding offering help for different data sources, it makes it conceivable to weave SQL questions with code changes which brings about an integral asset. The following is a case of a Hive perfect inquiry:

/sc is a current SparkContext.

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, esteem STRING)")

sqlContext.sql("LOAD DATA LOCAL INPATH 'models/src/fundamental/assets/kv1.txt' INTO TABLE src")

/Queries are communicated in HiveQL

sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)

Spark Streaming

Spark Streaming backings constant preparing of gushing data, for example, creation web server log documents (for example Apache Flume and HDFS/S3), online networking like Twitter, and different informing lines like Kafka. In the engine, Spark Streaming gets the info data streams and partitions the data into groups. Next, they get handled by the Spark motor and produce last stream of results in clumps, as portrayed underneath.

spark gushing

The Spark Streaming API intently coordinates that of the Spark Core, making it simple for developers to work in the realms of both bunch and spilling data.

MLlib

MLlib is an AI library that gives different calculations intended to scale out on a group for order, relapse, bunching, community oriented separating, etc (look at Toptal's article on AI for more data on that subject). A portion of these calculations additionally work with gushing data, for example, straight relapse utilizing common least squares or k-implies grouping (and more in transit). Apache Mahout (an AI library for Hadoop) has just gotten some distance from MapReduce and united on Spark MLlib.

GraphX

graphx

GraphX is a library for controlling charts and performing diagram equal tasks. It gives a uniform device to ETL, exploratory investigation and iterative chart calculations. Aside from worked in activities for chart control, it gives a library of basic diagram calculations, for example, PageRank.

The most effective method to Use Apache Spark: Event Detection Use Case

Since we have addressed the inquiry "What is Apache Spark?", how about we consider what sort of issues or difficulties it could be utilized for most successfully.

I ran over an article as of late about a trial to recognize a quake by investigating a Twitter stream. Strangely, it was indicated that this procedure was probably going to advise you regarding a quake in Japan snappier than the Japan Meteorological Agency. Despite the fact that they utilized diverse innovation in their article, I think it is an extraordinary guide to perceive how we could put Spark to use with disentangled code scraps and without the paste code.

In the first place, we would need to channel tweets which appear to be significant like "tremor" or "shaking". We could without much of a stretch use Spark Streaming for that reason as follows:

TwitterUtils.createStream(...)

.filter(_.getText.contains("earthquake") || _.getText.contains("shaking"))

At that point, we would need to run some semantic examination on the tweets to decide whether they give off an impression of being referencing a present tremor event. Tweets like "Seismic tremor!" or "Now it is shaking", for instance, would be think about positive matches, while tweets like "Going to an Earthquake Conference" or "The quake yesterday was alarming" would not. The creators of the paper utilized a help vector machine (SVM) for this reason. We'll do likewise here, yet can likewise attempt a gushing form. A subsequent code model from MLlib would resemble the accompanying:

/We would set up some quake tweet data and burden it in LIBSVM design.

val data = MLUtils.loadLibSVMFile(sc, "sample_earthquate_tweets.txt")

/Split data into preparing (60%) and test (40%).

val parts = data.randomSplit(Array(0.6, 0.4), seed = 11L)

val preparing = splits(0).cache()

val test = splits(1)

/Run preparing calculation to fabricate the model

val numIterations = 100

val model = SVMWithSGD.train(training, numIterations)

/Clear the default edge.

model.clearThreshold()

/Compute crude scores on the test set.

val scoreAndLabels = test.map { point =>

val score = model.predict(point.features)

(score, point.label)

}

/Get assessment measurements.

val measurements = new BinaryClassificationMetrics(scoreAndLabels)

val auROC = metrics.areaUnderROC()

println("Area under ROC = " + auROC)

On the off chance that we are content with the expectation pace of the model, we could move onto the following stage and respond at whatever point we find a quake. To distinguish one we need a specific number (i.e., thickness) of positive tweets in a characterized time window (as depicted in the article). Note that, for tweets with Twitter area administrations empowered, we would likewise remove the area of the seismic tremor. Furnished with this information, we could utilize SparkSQL and question a current Hive table (putting away clients keen on accepting seismic tremor notices) to recover their email addresses and send them a customized cautioning email, as follows:

/sc is a current SparkContext.

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

/sendEmail is a custom capacity

sqlContext.sql("FROM earthquake_warning_users SELECT firstName, lastName, city, email")

.collect().foreach(sendEmail)

Other Apache Spark Use Cases

Potential use cases for Spark stretch out a long ways past identification of quakes obviously.

Here's a speedy (yet surely not even close to comprehensive!) testing of other use cases that require managing the speed, assortment and volume of Big Data, for which Spark is so appropriate:

In the game business, preparing and finding designs from the potential firehose of constant in-game occasions and having the option to react to them quickly is a capacity that could yield a worthwhile business, for purposes, for example, player maintenance, directed publicizing, auto-alteration of multifaceted nature level, etc.

In the internet business industry, constant exchange data could be passed to a spilling bunching calculation like k-implies or synergistic separating like ALS. Results could then even be joined with other unstructured data sources, for example, client remarks or item surveys, and used to continually improve and adjust proposals after some time with new patterns.

In the fund or security industry, the Spark stack could be applied to an extortion or interruption location framework or hazard based confirmation. It could accomplish first class results by gathering colossal measures of documented logs, consolidating it with outside data sources like data about data ruptures and bargained accounts and data from the association/solicitation, for example, IP geolocation or time.

End

To summarize, Spark assists with improving the difficult and computationally escalated assignment of preparing high volumes of continuous or chronicled data, both organized and unstructured, flawlessly incorporating pertinent complex abilities, for example, AI and diagram calculations. Spark brings Big Data preparing to the majority.

Best Oracle Fusion Financials Training In Hyderabad | Oracle Fusion Financials Online Training

raveeena

Spark and Scala Introduction

Comments

Post a Comment

Popular Posts

Big Data and Hadoop Training Course Tutorial

Big Data and Hadoop