Big Data 5 min read

Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations

This tutorial demonstrates how to initialize a SparkContext in PySpark, perform simple parallel computations such as temperature conversion and reduction, create a SparkSession to read CSV data, and apply common DataFrame operations like selecting columns, adding new columns, filtering, grouping, and aggregating.

Python Programming Learning Circle

Apr 16, 2020

Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations

Spark applications run as independent processes coordinated by a SparkContext.

You can create a SparkContext in PySpark using:

from pyspark import SparkContext
sc = SparkContext('local[2]', 'Spark 101')

To use all available cores, initialize with: sc = SparkContext('local[*]', 'Spark 101') After creating the context, a simple Celsius‑to‑Kelvin conversion can be parallelized:

temp_c = [10, 3, -5, 25, 1, 9, 29, -10, 5]
rdd_temp_c = sc.parallelize(temp_c)
rdd_temp_K = rdd_temp_c.map(lambda x: x + 273.15).collect()
print(rdd_temp_K)

The result is a list of Kelvin temperatures.

[283.15, 276.15, 268.15, 298.15, 274.15, 282.15, 302.15, 263.15, 278.15]

Besides map, reduce can combine elements, e.g.:

numbers = [1, 4, 6, 2, 9, 10]
rdd_numbers = sc.parallelize(numbers)
rdd_reduce = rdd_numbers.reduce(lambda x, y: '(' + str(x) + ', ' + str(y) + ')')
print(rdd_reduce)

PySpark is often used for machine‑learning pipelines, and it provides rich data‑processing capabilities.

First, create a SparkSession:

from pyspark.sql import SparkSession
session = SparkSession.builder.appName('data_processing').getOrCreate()

Read a CSV file into a DataFrame and show its schema:

df = session.read.csv('iris.csv', inferSchema=True, header=True)
df.show()
df.printSchema()

You can select specific columns: df.select('sepal_width', 'petal_width').show(3) Add a new column derived from an existing one:

df.withColumn('new_col_name', df['sepal_width'] * 10).show(6)

Filter rows based on conditions:

df.filter(df['species'] == 'setosa').show(9)
df.filter((df['species'] == 'setosa') | (df['sepal_width'] < 3.0)).show(9)

Get distinct values of a column: df.select('species').distinct().show() Perform aggregations such as counting occurrences per value: df.groupBy('petal_length').count().show() Sort the aggregated results descending by count:

df.groupBy('petal_length').count().orderBy('count', ascending=False).show()

Aggregate other columns, e.g., sum of sepal_length per species: df.groupBy('species').agg({'sepal_length': 'sum'}).show() These examples cover basic data‑processing tasks in PySpark; the framework offers many more advanced features.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data parallel computing Spark PySpark

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.