Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations
This tutorial demonstrates how to initialize a SparkContext in PySpark, perform simple parallel computations such as temperature conversion and reduction, create a SparkSession to read CSV data, and apply common DataFrame operations like selecting columns, adding new columns, filtering, grouping, and aggregating.
Spark applications run as independent processes coordinated by a SparkContext.
You can create a SparkContext in PySpark using:
from pyspark import SparkContext
sc = SparkContext('local[2]', 'Spark 101')To use all available cores, initialize with:
sc = SparkContext('local[*]', 'Spark 101')After creating the context, a simple Celsius‑to‑Kelvin conversion can be parallelized:
temp_c = [10, 3, -5, 25, 1, 9, 29, -10, 5]
rdd_temp_c = sc.parallelize(temp_c)
rdd_temp_K = rdd_temp_c.map(lambda x: x + 273.15).collect()
print(rdd_temp_K)The result is a list of Kelvin temperatures.
[283.15, 276.15, 268.15, 298.15, 274.15, 282.15, 302.15, 263.15, 278.15]Besides map , reduce can combine elements, e.g.:
numbers = [1, 4, 6, 2, 9, 10]
rdd_numbers = sc.parallelize(numbers)
rdd_reduce = rdd_numbers.reduce(lambda x, y: '(' + str(x) + ', ' + str(y) + ')')
print(rdd_reduce)PySpark is often used for machine‑learning pipelines, and it provides rich data‑processing capabilities.
First, create a SparkSession:
from pyspark.sql import SparkSession
session = SparkSession.builder.appName('data_processing').getOrCreate()Read a CSV file into a DataFrame and show its schema:
df = session.read.csv('iris.csv', inferSchema=True, header=True)
df.show()
df.printSchema()You can select specific columns:
df.select('sepal_width', 'petal_width').show(3)Add a new column derived from an existing one:
df.withColumn('new_col_name', df['sepal_width'] * 10).show(6)Filter rows based on conditions:
df.filter(df['species'] == 'setosa').show(9)
df.filter((df['species'] == 'setosa') | (df['sepal_width'] < 3.0)).show(9)Get distinct values of a column:
df.select('species').distinct().show()Perform aggregations such as counting occurrences per value:
df.groupBy('petal_length').count().show()Sort the aggregated results descending by count:
df.groupBy('petal_length').count().orderBy('count', ascending=False).show()Aggregate other columns, e.g., sum of sepal_length per species:
df.groupBy('species').agg({'sepal_length': 'sum'}).show()These examples cover basic data‑processing tasks in PySpark; the framework offers many more advanced features.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.