SlideShare a Scribd company logo
Apache Spark 
In-Memory Data Processing 
September 2014 Meetup 
Organized by Big Data Hyderabad Meetup Group. 
http://www.meetup.com/Big-Data-Hyderabad/ 
Rahul Jain 
@rahuldausa
Agenda 
• Why Spark 
• Introduction 
• Basics 
• Hands-on 
– Installation 
– Examples 
2
Quick Questionnaire 
How many people know/work on Scala ? 
How many people know/work on Python ? 
How many people know/heard/are using Spark ?
Why Spark ? 
• Most of Machine Learning Algorithms are iterative because each iteration 
can improve the results 
• With Disk based approach each iteration’s output is written to disk making 
it slow 
Hadoop execution flow 
Spark execution flow 
http://www.wiziq.com/blog/hype-around-apache-spark/
About Apache Spark 
• Initially started at UC Berkeley in 2009 
• Fast and general purpose cluster computing system 
• 10x (on disk) - 100x (In-Memory) faster 
• Most popular for running Iterative Machine Learning Algorithms. 
• Provides high level APIs in 
• Java 
• Scala 
• Python 
• Integration with Hadoop and its eco-system and can read existing data. 
• http://spark.apache.org/
Spark Stack 
• Spark SQL 
– For SQL and unstructured data 
processing 
• MLib 
– Machine Learning Algorithms 
• GraphX 
– Graph Processing 
• Spark Streaming 
– stream processing of live data 
streams 
http://spark.apache.org
Execution Flow 
http://spark.apache.org/docs/latest/cluster-overview.html
Terminology 
• Application Jar 
– User Program and its dependencies except Hadoop & Spark Jars bundled into a 
Jar file 
• Driver Program 
– The process to start the execution (main() function) 
• Cluster Manager 
– An external service to manage resources on the cluster (standalone manager, 
YARN, Apache Mesos) 
• Deploy Mode 
– cluster : Driver inside the cluster 
– client : Driver outside of Cluster
Terminology (contd.) 
• Worker Node : Node that run the application program in cluster 
• Executor 
– Process launched on a worker node, that runs the Tasks 
– Keep data in memory or disk storage 
• Task : A unit of work that will be sent to executor 
• Job 
– Consists multiple tasks 
– Created based on a Action 
• Stage : Each Job is divided into smaller set of tasks called Stages that is sequential 
and depend on each other 
• SparkContext : 
– represents the connection to a Spark cluster, and can be used to create RDDs, 
accumulators and broadcast variables on that cluster.
Resilient Distributed Dataset (RDD) 
• Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark 
• Immutable, Partitioned collection of elements that can be operated in parallel 
• Basic Operations 
– map 
– filter 
– persist 
• Multiple Implementation 
– PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join 
– DoubleRDDFunctions : Operation related to double values 
– SequenceFileRDDFunctions : Operation related to SequenceFiles 
• RDD main characteristics: 
– A list of partitions 
– A function for computing each split 
– A list of dependencies on other RDDs 
– Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 
– Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) 
• Custom RDD can be also implemented (by overriding functions)
Cluster Deployment 
• Standalone Deploy Mode 
– simplest way to deploy Spark on a private cluster 
• Amazon EC2 
– EC2 scripts are available 
– Very quick launching a new cluster 
• Apache Mesos 
• Hadoop YARN
Monitoring
Monitoring – Stages
Monitoring – Stages
Let’s try some examples…
Spark Shell 
./bin/spark-shell --master local[2] 
The --master option specifies the master URL for a distributed cluster, or local to run 
locally with one thread, or local[N] to run locally with N threads. You should start by 
using local for testing. 
./bin/run-example SparkPi 10 
This will run 10 iterations to calculate the value of Pi
Basic operations… 
scala> val textFile = sc.textFile("README.md") 
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 
scala> textFile.count() // Number of items in this RDD 
ees0: Long = 126 
scala> textFile.first() // First item in this RDD 
res1: String = # Apache Spark 
scala> val linesWithSpark = textFile.filter(line => 
line.contains("Spark")) 
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 
Simplier - Single liner: 
scala> textFile.filter(line => line.contains("Spark")).count() 
// How many lines contain "Spark"? 
res3: Long = 15
Map - Reduce 
scala> textFile.map(line => line.split(" ").size).reduce((a, b) 
=> if (a > b) a else b) 
res4: Long = 15 
scala> import java.lang.Math 
scala> textFile.map(line => line.split(" ").size).reduce((a, b) 
=> Math.max(a, b)) 
res5: Int = 15 
scala> val wordCounts = textFile.flatMap(line => line.split(" 
")).map(word => (word, 1)).reduceByKey((a, b) => a + b) 
wordCounts: spark.RDD[(String, Int)] = 
spark.ShuffledAggregatedRDD@71f027b8 
wordCounts.collect()
With Caching… 
scala> linesWithSpark.cache() 
res7: spark.RDD[String] = spark.FilteredRDD@17e51082 
scala> linesWithSpark.count() 
res8: Long = 15 
scala> linesWithSpark.count() 
res9: Long = 15
With HDFS… 
val lines = spark.textFile(“hdfs://...”) 
val errors = lines.filter(line => line.startsWith(“ERROR”)) 
println(Total errors: + errors.count())
Standalone (Scala) 
/* SimpleApp.scala */ 
import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
import org.apache.spark.SparkConf 
object SimpleApp { 
def main(args: Array[String]) { 
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your 
system 
val conf = new SparkConf().setAppName("Simple Application") 
.setMaster(“local") 
val sc = new SparkContext(conf) 
val logData = sc.textFile(logFile, 2).cache() 
val numAs = logData.filter(line => line.contains("a")).count() 
val numBs = logData.filter(line => line.contains("b")).count() 
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) 
} 
}
Standalone (Java) 
/* SimpleApp.java */ 
import org.apache.spark.api.java.*; 
import org.apache.spark.SparkConf; 
import org.apache.spark.api.java.function.Function; 
public class SimpleApp { 
public static void main(String[] args) { 
String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system 
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); 
JavaSparkContext sc = new JavaSparkContext(conf); 
JavaRDD<String> logData = sc.textFile(logFile).cache(); 
long numAs = logData.filter(new Function<String, Boolean>() { 
public Boolean call(String s) { return s.contains("a"); } 
}).count(); 
long numBs = logData.filter(new Function<String, Boolean>() { 
public Boolean call(String s) { return s.contains("b"); } 
}).count(); 
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); 
} 
}
Standalone (Python) 
"""SimpleApp.py""" 
from pyspark import SparkContext 
logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your 
system 
sc = SparkContext("local", "Simple App") 
logData = sc.textFile(logFile).cache() 
numAs = logData.filter(lambda s: 'a' in s).count() 
numBs = logData.filter(lambda s: 'b' in s).count() 
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
Job Submission 
$SPARK_HOME/bin/spark-submit  
--class "SimpleApp"  
--master local[4]  
target/scala-2.10/simple-project_2.10-1.0.jar
Configuration 
val conf = new SparkConf() 
.setMaster("local") 
.setAppName("CountingSheep") 
.set("spark.executor.memory", "1g") 
val sc = new SparkContext(conf)
Questions ? 
26
Thanks! 
@rahuldausa on twitter and slideshare 
http://www.linkedin.com/in/rahuldausa 
27 
Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR 
http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ 
http://www.meetup.com/DataAnalyticsGroup/ 
Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies. 
http://www.meetup.com/Big-Data-Hyderabad/

More Related Content

PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Apache spark
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Introduction to apache spark
PDF
Apache Spark Overview
PDF
Apache Spark Introduction
PDF
Spark SQL
PPTX
Intro to Apache Spark
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Apache spark
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to apache spark
Apache Spark Overview
Apache Spark Introduction
Spark SQL
Intro to Apache Spark

What's hot (20)

PDF
Spark overview
PPTX
Apache Spark Architecture
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Introduction to Apache Spark
PPTX
Apache Spark overview
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PDF
Understanding Query Plans and Spark UIs
PPTX
Apache Spark Core
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Programming in Spark using PySpark
PPT
Introduction to mongodb
PPTX
Apache Flink and what it is used for
PDF
Introduction to Spark Internals
PPTX
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Spark overview
Apache Spark Architecture
Deep Dive: Memory Management in Apache Spark
Introduction to Apache Spark
Apache Spark overview
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Introduction to Apache Flink - Fast and reliable big data processing
Understanding Query Plans and Spark UIs
Apache Spark Core
A Deep Dive into Query Execution Engine of Spark SQL
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Programming in Spark using PySpark
Introduction to mongodb
Apache Flink and what it is used for
Introduction to Spark Internals
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Ad

Viewers also liked (6)

PPTX
Introduction to Apache Spark Developer Training
PDF
Introduction to Apache Spark
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
PDF
SQL to Hive Cheat Sheet
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
Introduction to Apache Spark Developer Training
Introduction to Apache Spark
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
SQL to Hive Cheat Sheet
How to understand and analyze Apache Hive query execution plan for performanc...
Apache Spark 2.0: Faster, Easier, and Smarter
Ad

Similar to Introduction to Apache Spark (20)

PPTX
Spark core
PPTX
Introduction to Apache Spark
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PDF
Apache spark basics
PDF
Apache Spark Tutorial
PDF
A Deep Dive Into Spark
PDF
Introduction to Spark
PPTX
Intro to Apache Spark
PPTX
SparkNotes
PPT
Scala and spark
PDF
Apache Spark RDDs
PDF
Meetup ml spark_ppt
PDF
Apache Spark Tutorial
PDF
Scala Meetup Hamburg - Spark
PDF
Apache Spark: What? Why? When?
PDF
Introduction to apache spark
PDF
Introduction to apache spark
PDF
Introduction to Apache Spark
PPTX
OVERVIEW ON SPARK.pptx
Spark core
Introduction to Apache Spark
Apache spark sneha challa- google pittsburgh-aug 25th
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache spark basics
Apache Spark Tutorial
A Deep Dive Into Spark
Introduction to Spark
Intro to Apache Spark
SparkNotes
Scala and spark
Apache Spark RDDs
Meetup ml spark_ppt
Apache Spark Tutorial
Scala Meetup Hamburg - Spark
Apache Spark: What? Why? When?
Introduction to apache spark
Introduction to apache spark
Introduction to Apache Spark
OVERVIEW ON SPARK.pptx

More from Rahul Jain (15)

PDF
Flipkart Strategy Analysis and Recommendation
PPTX
Emerging technologies /frameworks in Big Data
PPTX
Case study of Rujhaan.com (A social news app )
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PPTX
Introduction to Machine Learning
PPTX
Introduction to Scala
PPTX
What is NoSQL and CAP Theorem
PPTX
Introduction to Elasticsearch with basics of Lucene
PPTX
Introduction to Apache Lucene/Solr
PPTX
Introduction to Lucene & Solr and Usecases
PPTX
Introduction to Kafka and Zookeeper
PPTX
Apache kafka
PPTX
Hadoop & HDFS for Beginners
DOC
Hibernate tutorial for beginners
Flipkart Strategy Analysis and Recommendation
Emerging technologies /frameworks in Big Data
Case study of Rujhaan.com (A social news app )
Building a Large Scale SEO/SEM Application with Apache Solr
Real time Analytics with Apache Kafka and Apache Spark
Introduction to Machine Learning
Introduction to Scala
What is NoSQL and CAP Theorem
Introduction to Elasticsearch with basics of Lucene
Introduction to Apache Lucene/Solr
Introduction to Lucene & Solr and Usecases
Introduction to Kafka and Zookeeper
Apache kafka
Hadoop & HDFS for Beginners
Hibernate tutorial for beginners

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Top Generative AI Tools for Patent Drafting in 2025.pdf
PPTX
Belt and Road Supply Chain Finance Blockchain Solution
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PDF
Chapter 2 Digital Image Fundamentals.pdf
PDF
How AI Agents Improve Data Accuracy and Consistency in Due Diligence.pdf
PDF
ai-archetype-understanding-the-personality-of-agentic-ai.pdf
PPTX
How to Build Crypto Derivative Exchanges from Scratch.pptx
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
creating-agentic-ai-solutions-leveraging-aws.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
PDF
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
REPORT: Heating appliances market in Poland 2024
PDF
DevOps & Developer Experience Summer BBQ
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Top Generative AI Tools for Patent Drafting in 2025.pdf
Belt and Road Supply Chain Finance Blockchain Solution
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
Chapter 2 Digital Image Fundamentals.pdf
How AI Agents Improve Data Accuracy and Consistency in Due Diligence.pdf
ai-archetype-understanding-the-personality-of-agentic-ai.pdf
How to Build Crypto Derivative Exchanges from Scratch.pptx
madgavkar20181017ppt McKinsey Presentation.pdf
creating-agentic-ai-solutions-leveraging-aws.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Enable Enterprise-Ready Security on IBM i Systems.pdf
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reimagining Insurance: Connected Data for Confident Decisions.pdf
REPORT: Heating appliances market in Poland 2024
DevOps & Developer Experience Summer BBQ
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf

Introduction to Apache Spark

  • 1. Apache Spark In-Memory Data Processing September 2014 Meetup Organized by Big Data Hyderabad Meetup Group. http://www.meetup.com/Big-Data-Hyderabad/ Rahul Jain @rahuldausa
  • 2. Agenda • Why Spark • Introduction • Basics • Hands-on – Installation – Examples 2
  • 3. Quick Questionnaire How many people know/work on Scala ? How many people know/work on Python ? How many people know/heard/are using Spark ?
  • 4. Why Spark ? • Most of Machine Learning Algorithms are iterative because each iteration can improve the results • With Disk based approach each iteration’s output is written to disk making it slow Hadoop execution flow Spark execution flow http://www.wiziq.com/blog/hype-around-apache-spark/
  • 5. About Apache Spark • Initially started at UC Berkeley in 2009 • Fast and general purpose cluster computing system • 10x (on disk) - 100x (In-Memory) faster • Most popular for running Iterative Machine Learning Algorithms. • Provides high level APIs in • Java • Scala • Python • Integration with Hadoop and its eco-system and can read existing data. • http://spark.apache.org/
  • 6. Spark Stack • Spark SQL – For SQL and unstructured data processing • MLib – Machine Learning Algorithms • GraphX – Graph Processing • Spark Streaming – stream processing of live data streams http://spark.apache.org
  • 8. Terminology • Application Jar – User Program and its dependencies except Hadoop & Spark Jars bundled into a Jar file • Driver Program – The process to start the execution (main() function) • Cluster Manager – An external service to manage resources on the cluster (standalone manager, YARN, Apache Mesos) • Deploy Mode – cluster : Driver inside the cluster – client : Driver outside of Cluster
  • 9. Terminology (contd.) • Worker Node : Node that run the application program in cluster • Executor – Process launched on a worker node, that runs the Tasks – Keep data in memory or disk storage • Task : A unit of work that will be sent to executor • Job – Consists multiple tasks – Created based on a Action • Stage : Each Job is divided into smaller set of tasks called Stages that is sequential and depend on each other • SparkContext : – represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
  • 10. Resilient Distributed Dataset (RDD) • Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark • Immutable, Partitioned collection of elements that can be operated in parallel • Basic Operations – map – filter – persist • Multiple Implementation – PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join – DoubleRDDFunctions : Operation related to double values – SequenceFileRDDFunctions : Operation related to SequenceFiles • RDD main characteristics: – A list of partitions – A function for computing each split – A list of dependencies on other RDDs – Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) – Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) • Custom RDD can be also implemented (by overriding functions)
  • 11. Cluster Deployment • Standalone Deploy Mode – simplest way to deploy Spark on a private cluster • Amazon EC2 – EC2 scripts are available – Very quick launching a new cluster • Apache Mesos • Hadoop YARN
  • 15. Let’s try some examples…
  • 16. Spark Shell ./bin/spark-shell --master local[2] The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. ./bin/run-example SparkPi 10 This will run 10 iterations to calculate the value of Pi
  • 17. Basic operations… scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> textFile.count() // Number of items in this RDD ees0: Long = 126 scala> textFile.first() // First item in this RDD res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 Simplier - Single liner: scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15
  • 18. Map - Reduce scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) res4: Long = 15 scala> import java.lang.Math scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) res5: Int = 15 scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8 wordCounts.collect()
  • 19. With Caching… scala> linesWithSpark.cache() res7: spark.RDD[String] = spark.FilteredRDD@17e51082 scala> linesWithSpark.count() res8: Long = 15 scala> linesWithSpark.count() res9: Long = 15
  • 20. With HDFS… val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(line => line.startsWith(“ERROR”)) println(Total errors: + errors.count())
  • 21. Standalone (Scala) /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") .setMaster(“local") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } }
  • 22. Standalone (Java) /* SimpleApp.java */ import org.apache.spark.api.java.*; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; public class SimpleApp { public static void main(String[] args) { String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logData = sc.textFile(logFile).cache(); long numAs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("a"); } }).count(); long numBs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("b"); } }).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); } }
  • 23. Standalone (Python) """SimpleApp.py""" from pyspark import SparkContext logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system sc = SparkContext("local", "Simple App") logData = sc.textFile(logFile).cache() numAs = logData.filter(lambda s: 'a' in s).count() numBs = logData.filter(lambda s: 'b' in s).count() print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
  • 24. Job Submission $SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
  • 25. Configuration val conf = new SparkConf() .setMaster("local") .setAppName("CountingSheep") .set("spark.executor.memory", "1g") val sc = new SparkContext(conf)
  • 27. Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa 27 Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ http://www.meetup.com/DataAnalyticsGroup/ Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies. http://www.meetup.com/Big-Data-Hyderabad/