Distributed Computing and Apache Spark — Big Data & Machine Learning

Systems for Big Data Processing

A single computer with 512 GB RAM and a fast SSD is powerful, but modern companies’ data amounts to petabytes. Distributed computing systems allow this data to be processed on clusters of thousands of machines. Understanding these systems is critical for an ML engineer working with real-world data.

MapReduce: Parallel Computing Paradigm

Idea (Dean & Ghemawat, Google, 2004): break the computation into two stages, each of which is parallelized by data.

Map(k, v) → list of (k', v'): applied independently to each record. Example: word count — Map("hello world", 1) → [("hello",1), ("world",1)].

Shuffle: the system automatically groups all pairs (k',v') by key k'. The most expensive step is data movement over the network.

Reduce(k', list v') → list (k'',v''): aggregates all values for one key. Example: Reduce("hello", [1,1,1]) → ("hello", 3).

PageRank via MapReduce: iterative matrix-vector multiplication. Each iteration is one MapReduce job. Disk problem: I/O between iterations (writing and reading from HDFS each time).

Hadoop HDFS: distributed file system. Blocks of 128 MB, each replicated 3×. Master (NameNode) stores metadata. DataNode — data storage. Fault tolerance: in case of DataNode failure, the NameNode orders new replicas.

Apache Spark: In-Memory Computation

Key difference from MapReduce: data remains in memory between operations (RDD.cache()). For iterative algorithms (ML, PageRank) speedup is 10–100×.

RDD (Resilient Distributed Dataset): immutable, distributed collection of elements. Partitioned across cluster nodes. Lineage graph (dependency graph) provides fault tolerance: on failure, only lost partitions are recomputed.

Transformations (lazy): map, filter, flatMap, join, groupByKey, reduceByKey. Not executed immediately — build computation graph (DAG).

Actions (eager): collect, count, reduce, saveAsTextFile — trigger DAG execution. Catalyst optimizer analyzes the DAG, applies optimizations (pushdown filters, column pruning, join reordering).

DataFrame API: columnar storage, SQL compatibility, optimization via Catalyst. Example: spark.read.parquet("data").filter("age > 25").groupBy("city").agg(avg("salary")). Internally, Catalyst finds the optimal execution plan.

Spark MLlib: scalable ML algorithms. LinearRegression, LogisticRegression, RandomForest, GBT, KMeans, ALS (Alternating Least Squares for collaborative filtering). ALS for recommendations: Netflix-style on billions of ratings.

GPU Computing: Core-Level Parallelism

GPU vs CPU: CPU: 8–64 cores, large cache, optimized for sequential computation. GPU: 5000–10000+ cores, small cache, optimized for massively parallel computation.

CUDA model: N blocks × M threads = N×M parallel “threads”. Each thread works with a small part of the data. Warps (32 threads) — main execution unit.

Tensor Cores (NVIDIA Volta+): matrix multiplications FP16/BF16 → 8× higher throughput vs FP32 cores. A100: 312 TFLOPS (TF32), 1248 TOPS (INT8). H100: 2000 TFLOPS (BF16).

torch.compile (PyTorch 2.0): compiles Python/PyTorch graph into optimized CUDA kernels via TorchInductor + Triton. 2× speedup “for free” on typical transformers.

Efficient Inference

Problem: trained LLaMA-70B model requires 140 GB GPU RAM in FP16. Most tasks do not require full precision.

Quantization: FP16/BF16 instead of FP32 (2×), INT8 (4×), INT4 (8×). GPTQ: post-training quantization of weights to INT4 with minimal loss of quality. AWQ: activation-aware quantization.

KV-cache: in autoregressive generation, keys and values of previous tokens are cached. Without cache: O(n²) operations to generate n tokens. With cache: O(n). Bottleneck for long context: cache of 32K tokens × 70B = 40 GB.

vLLM: PagedAttention — managing KV cache as OS virtual memory. Continuous batching: efficiently process different requests of varying lengths. 24× higher throughput vs naive serving.

Numerical Example

Apache Spark WordCount on a 10-node cluster, 1 TB text file:

MapReduce (Hadoop): ~45 minutes (disk I/O dominates)
Spark (RDD, in-memory): ~8 minutes (6× speedup)
Spark (DataFrame + Parquet): ~3 minutes (columnar storage + Catalyst)

PageRank calculation for a graph of 1B nodes:

MapReduce: 60 iterations × 5 min = 5 hours
Spark with cacheRDD: 60 iterations × 30 sec = 30 minutes

Assignment: (1) Implement Word2Vec on PySpark MLlib on a text corpus (Wikidump 1GB). Measure time and quality (analogies like king−man+woman). (2) Implement ALS for collaborative filtering on MovieLens 20M. Hyperparameters: rank=50, maxIter=10, regParam=0.1. Compute RMSE. (3) Profile execution: Spark UI → which stage takes the most time? How can it be improved?