WordCount program using Spark DataFrame

I wanted to figure out how to write Word Count Program using Spark DataFrame API, so i followed these steps. Import org.apache.spark.sql.functions._, it includes UDF's that i need to use import org.apache.spark.sql.functions._ Create a data frame by reading README.md. When you read the file, spark will create a data frame with single column value, the content of the value column would be the line in the file

val df = sqlContext.read.text("README.md")
df.show(10,truncate=false)
Next split each of the line into words using split function. This will create a new DataFrame with words column, each words column would have array of words for that line

val wordsDF = df.select(split(df("value")," ").alias("words"))
wordsDF.show(10,truncate=false)
Next use explode transformation to convert the words array into a dataframe with word column. This is equivalent of using flatMap() method on RDD

val wordDF = wordsDF.select(explode(wordsDF("words")).alias("word"))
wordsDF.show(10,truncate=false)
Now you have data frame with each line containing single word in the file. So group the data frame based on word and count the occurrence of each word

val wordCountDF = wordDF.groupBy("word").count
wordCountDF.show(truncate=false)
This is the code you need if you want to figure out 20 top most words in the file

wordCountDF.orderBy(desc("count")).show(truncate=false)

12 comments:

App Development Company said...

Word count program using spark Data Frame has explained in a very convenient way so that every visitor will easily understand.

Unknown said...

Let's say after explode

you had data like

word - Count
Module, 1
Module 2
Module:3
Module- 1

So though word here is only module, you are counting without stripping special characters. In this case this solution doesn't seems complete no?

Unknown said...

Worthful Spark tutorial. Appreciate a lot for taking up the pain to write such a quality content on Spark Training. Just now I watched this similar Spark tutorial and I think this will enhance the knowledge of other visitors for sure. Thanks anyway.:-https://www.youtube.com/watch?v=dMDQz82FCqE

Joe said...

Extra-Ordinary piece of work. Interesting concepts to read. Very much informative. Thanks for sharing. Waiting for your future posts.
Tableau Training in Chennai
Tableau Course in Chennai
Tableau Training Institutes in Chennai
Tableau Training in Tambaram
Spoken English Classes in Chennai
Best Spoken English Classes in Chennai
SAS Training in Chennai
SAS Course in Chennai

Abhi said...

Thanks for info....
Website development in Bangalore

printer support said...

The team at printer support number service will help you fix all sorts of issues for all brands of printers. They will provide speedy resolutions to repair the printer and will also tweak its settings to ensure that your printer gives its best performance.

Lexmark Printer Support said...

Lexmark Printer Support | Brother printer suppor | Lexmark Printer support number | Lexmark printer toll free number

siva sreedhar said...

nice post on Spark Training

Adams Young said...

I’ve read some good stuff here. Definitely worth bookmarking for revisiting. I surprise how much effort you put to create such a great informative website. view

braincandy said...

We offer the best Web Design & Web Development Company In Mumbai, India. Brain candy provides services like E-commerce development, WordPress development, and more services.
Please keep sharing this types of blog, "Web Design & Web Development Company In Mumbai, India"

Geevi said...

This blog are very informative! We find these technology-related topics. Thanks for the post! Very useful!

react native app development company
devops services company
digital transformation services company

Unknown said...

I just came across your blog post and must say that it’s a great piece of information that you have shared. Visit for more info email marketing services in mumbai