learnxinyminutes-docs/spark.html.markdown at 08846f6c2087323a3209f8f2c06a69120e871da4

iskm/learnxinyminutes-docs

Fork 0

mirror of https://github.com/adambard/learnxinyminutes-docs.git synced 2025-04-26 07:03:57 +00:00

scronge 08846f6c20

Update spark.html.markdown

2024-11-09 13:50:37 -06:00

1.3 KiB

Raw Blame History

language

category

tool

filename

contributors

Spark

tool

Spark

learnspark.spark

Scronge

https://github.com/Scronge

Spark is an open-source distributed data processing framework that enables large-scale data processing across clusters. This guide covers the basics of Apache Spark using PySpark, the Python API.

# Setting Up Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ExampleApp") \
    .getOrCreate()

# Working with DataFrames
data = [("Alice", 30), ("Bob", 40)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)
df.show()
# +-----+---+
# | Name|Age|
# +-----+---+
# |Alice| 30|
# |  Bob| 40|
# +-----+---+

# Transformations and Actions

df_filtered = df.filter(df.Age > 35)
df_filtered.show()
# +----+---+
# |Name|Age|
# +----+---+
# | Bob| 40|
# +----+---+

# SQL Queries

df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE Age > 30").show()

# Reading and Writing Files

csv_df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
df.write.parquet("output_path")

# RDD Basics

rdd = spark.sparkContext.parallelize([1, 2, 3, 4])

squared_rdd = rdd.map(lambda x: x ** 2)
print(squared_rdd.collect())
# Output: [1, 4, 9, 16]

# Ending the Spark Session

spark.stop()

1.3 KiB Raw Blame History

1.3 KiB

Raw Blame History