Big Data Interview Drill: Data Skew, Spark OOM, PySpark, MySQL, and Storage Engines

Published: 2025-08-29

This review is built around the kinds of questions that come up again and again in big data development interviews, especially for internship roles. The focus is practical rather than theoretical: data skew, Spark troubleshooting, sorting algorithms, database connections, common Python tooling, PySpark hands-on work, and the differences between storage systems such as ClickHouse, Elasticsearch, and Hive.

Data skew: what it is, how to spot it, and how to handle it

What data skew actually means

Data skew is one of the most common performance problems in distributed computing systems such as Spark and MapReduce. It happens when data is distributed very unevenly across nodes. A few tasks or executors end up processing far more data than the rest—for example, one node handling ten million records while others only process one hundred thousand.

When that happens, the overloaded node becomes the bottleneck. Typical symptoms include very slow execution, executor OOM, and in severe cases a failed job. Even if the rest of the cluster finishes quickly, the whole task still waits for the slowest skewed partition.

The usual signs are easy to recognize:

a job gets stuck around 99% in the reduce phase
only a small number of executors run out of memory
CPU and memory usage differ drastically across nodes

How to detect data skew

There are three common ways to find it.

Use monitoring tools

In Spark UI, usually on port 4040, check task execution details for each stage. The Input Size/Records column is especially useful. If one task is processing dozens of times more data than the others, that is a strong signal of skew. Executor-level CPU and memory metrics can also help; skewed executors usually show abnormally high load.

Read the logs

Execution logs often reveal symptoms such as Task is running too slow or OOM. If the errors repeatedly show up on only a few task IDs, and key distribution statistics from a group by or join field confirm that some keys are heavily overrepresented, the skewed keys can be identified.

Check distribution before the job runs

For important fields used in group by or joins, it is often worth running a frequency check first, such as countByKey(). If a handful of keys appear far more often than the rest—for example, one key appearing a million times while most others stay below ten thousand—you can predict skew before it turns into a runtime problem.

How to fix data skew

At its core, data skew is a distribution problem. Fixes usually target one of three layers:

the map stage, where data can be filtered, reduced, or redirected before shuffle
the reduce stage, where skew actually becomes the bottleneck
supporting optimizations, such as adaptive execution or resource tuning

Map-side strategies

1. Enable map-side pre-aggregation

A common way to cut down shuffle pressure is to aggregate early.

In Spark, spark.sql.mapAggregateBytesLimit controls map-side aggregation behavior, with a default of 10485760 bytes. More importantly, in key-based aggregation scenarios, prefer reduceByKey over groupByKey, because reduceByKey includes map-side combining and can significantly reduce shuffle volume.

2. Filter or split skewed keys before shuffle

If skewed keys are invalid data—such as null values or test records—remove them before shuffle. If the skewed keys are valid but limited in number, isolate them and process them separately on the map side, then store the partial result and keep them out of the main reduce path.

3. Broadcast the small table for map-side joins

In join scenarios, if one side is small enough, broadcasting it can eliminate shuffle entirely. A small table under about 100 MB is a typical candidate. Spark can broadcast it to all executors so that the large table joins locally on the map side.

The broadcast threshold can be adjusted through spark.sql.autoBroadcastJoinThreshold, and in Hive-style syntax a map join can also be forced manually, for example with /*+ MAPJOIN(table_name) */.

Reduce-side strategies

1. Add random prefixes to skewed keys

This is one of the most classic solutions.

If a specific key causes skew in a group by or join, add a random prefix to split it into multiple subkeys before reduce. For example, user10086 can be transformed into 0_user10086, 1_user10086, and so on. That spreads the records across multiple reducers. After partial aggregation, remove the prefixes and perform a second aggregation to recover the final result.

2. Split the skewed keys in the large table and expand the small table

This is specific to joins where both sides are large, but only one table contains skewed keys.

The large table’s hot keys are randomized with prefixes, and the other table is expanded by duplicating matching records with the same prefix range. Then the join can be distributed evenly. After the join, the prefixes are removed so the result remains correct.

3. Increase shuffle parallelism

More shuffle partitions can reduce the chance that a single reducer gets too much data.

The relevant Spark parameter is spark.sql.shuffle.partitions, which defaults to 200. In larger jobs, values around 1000 to 2000 are often used so that work is spread more evenly across reducers.

Supporting optimizations

1. Turn on adaptive execution

With spark.sql.adaptive.enabled=true, Spark can adjust partitioning and execution plans based on actual runtime statistics. It can merge very small tasks, split overly large ones, and reduce the amount of manual tuning needed in mildly skewed workloads.

2. Preprocess upstream data

Sometimes the best fix is before Spark even starts. High-frequency keys can be sharded in storage, and special values such as null can be replaced with legal randomized keys where appropriate, reducing skew risk at the source.

3. Improve resource allocation

If a skewed reducer cannot be avoided completely, stronger executors may help it survive the workload. Typical settings include something like --executor-memory 8g --executor-cores 4, along with spark.executor.memoryOverhead=2g to expand off-heap memory and reduce the risk of memory-related failure.

Spark job troubleshooting: especially OOM

If a Spark job fails, how should the investigation start?

A practical troubleshooting path usually has three steps.

Step 1: identify the error type and where it happened

Start with the logs.

Driver logs are typically available on the node where the job was submitted.
Executor logs can be accessed in Spark UI under Executors → Logs.

The first goal is to determine whether it is a Driver OOM or an Executor OOM, and which task ID or stage ID is involved.

Spark UI helps narrow it further:

In Stages, look at task runtime, input size, and memory behavior.
In Storage, check whether too many RDDs are cached.

Step 2: analyze the likely root cause

Common reasons for Executor OOM include:

data skew causing one task to process far too much data
too much RDD caching, or cached data not being released
misuse of operators such as collect() that create memory pressure
executor memory that is simply too small for the job

Common reasons for Driver OOM include:

using collect() or take() to pull too much data back to the driver
oversized broadcast variables
driver memory being underconfigured

Step 3: verify the hypothesis

If skew is suspected, inspect key distribution with something like countByKey().
If caching is suspected, use the Storage page to see which RDDs occupy memory.
If operators are the problem, review code for actions and transformations that may create large in-memory datasets, especially collect() and potentially heavy aggregations.

If the issue is OOM, what can actually be optimized?

Main optimization directions

Solve skew first

If the real issue is one reducer processing a disproportionate amount of data, memory tuning alone will not be enough. The skew handling methods above—key randomization, higher shuffle parallelism, map-side joins—are the first things to consider.

Improve memory usage

Cache only what is necessary.
Prefer persist(StorageLevel.MEMORY_AND_DISK_SER) when memory is tight, because it spills to disk and stores serialized data, using less memory than the default MEMORY_ONLY.
Avoid pulling large datasets to the driver. Unless the result is tiny, do not use collect(). If inspection is needed, takeSample() is usually safer.
Choose more efficient operators. reduceByKey is generally better than groupByKey because it reduces shuffle data volume.
Filter or repartition large tables before joins and aggregations where possible.

Adjust resource configuration

Executor memory, CPU cores, and overhead settings should match the workload size and complexity.

Useful parameters for OOM scenarios

Executor-side settings

--executor-memory 8g sets memory per executor; the default is usually 1g, and larger jobs often need 8g to 16g.
--executor-cores 4 increases parallel processing capability within an executor.
spark.executor.memoryOverhead=2g increases off-heap memory, which can prevent OOM caused by insufficient overhead space.

Driver-side settings

--driver-memory 4g increases driver memory, especially important if the job uses collect() or large broadcasts.
spark.driver.memoryOverhead=1g adjusts driver off-heap memory.

Shuffle and storage settings

spark.sql.shuffle.partitions=1000 increases shuffle parallelism and often helps with skew-related OOM.
spark.storage.memoryFraction=0.6 controls the share of memory used for caching. If the job does not rely heavily on cached data, lowering this can free more execution memory.

Other helpful parameters

spark.task.maxFailures=4 controls task retry count. In mild cases, retries may let a task succeed.
spark.sql.adaptive.enabled=true enables adaptive execution for dynamic plan and partition adjustments.

Sorting algorithms that are commonly asked in interviews

In interviews, you are usually expected to know the major sorting algorithms, their complexity, whether they are stable, and when they are a good fit.

<table> <thead> <tr> <th>Sorting Algorithm</th> <th>Average Time</th> <th>Worst Time</th> <th>Space Complexity</th> <th>Stable</th> <th>Typical Characteristics and Use Cases</th> </tr> </thead> <tbody> <tr> <td>Bubble Sort</td> <td>O(n²)</td> <td>O(n²)</td> <td>O(1)</td> <td>Yes</td> <td>Very intuitive, sorts through adjacent swaps; suitable only for small datasets because it is inefficient.</td> </tr> <tr> <td>Selection Sort</td> <td>O(n²)</td> <td>O(n²)</td> <td>O(1)</td> <td>No</td> <td>Repeatedly selects the minimum or maximum and places it in position; useful when swaps are expensive and data size is small.</td> </tr> <tr> <td>Insertion Sort</td> <td>O(n²)</td> <td>O(n²)</td> <td>O(1)</td> <td>Yes</td> <td>Inserts each element into an already sorted portion; performs well on small or nearly sorted data.</td> </tr> <tr> <td>Quick Sort</td> <td>O(nlogn)</td> <td>O(n²)</td> <td>O(logn) recursive stack</td> <td>No</td> <td>Divide-and-conquer approach based on pivot partitioning; widely used in practice and efficient on large datasets.</td> </tr> <tr> <td>Merge Sort</td> <td>O(nlogn)</td> <td>O(nlogn)</td> <td>O(n)</td> <td>Yes</td> <td>Divide-and-conquer with splitting and merging; requires extra space but is stable and good for large-scale or external sorting.</td> </tr> <tr> <td>Heap Sort</td> <td>O(nlogn)</td> <td>O(nlogn)</td> <td>O(1)</td> <td>No</td> <td>Based on heap structure, repeatedly extracts the heap top; efficient and space-saving for large datasets.</td> </tr> <tr> <td>Radix Sort</td> <td>O(d*(n+r))</td> <td>O(d*(n+r))</td> <td>O(n+r)</td> <td>Yes</td> <td>Sorts digit by digit; efficient for fixed-length integers or strings at scale.</td> </tr> </tbody> </table>

If the interview goes deeper, quick sort, merge sort, and heap sort are usually the ones worth explaining in more detail:

Quick sort: pivot choice directly affects efficiency
Merge sort: stable and well suited to external sorting
Heap sort: good time complexity with constant auxiliary space

Ways to connect to MySQL from a server

There are three standard connection methods depending on the situation: command line, code, or GUI tools.

Command-line connection

This is the simplest method for quick checks.

It requires a MySQL client such as mysql-client, and the command format is:

mysql -h <host_ip> -P <port> -u <username> -p <database>

Parameter notes:

-h: MySQL server IP address; for local access it can be omitted or replaced with localhost or 127.0.0.1
-P: port number, default is 3306
-u: login user, such as root
-p: prompts for a password; there should be no space after -p
<database>: optional; if provided, login goes directly into that database

Example:

mysql -u root -p test

Connecting through code

Python with `pymysql`

The normal flow is to install the library with pip install pymysql, then establish a connection, create a cursor, run SQL, read the result, and close resources.

import pymysql

# 建立连接
conn = pymysql.connect(
    host='localhost',  # 服务器IP
    port=3306,         # 端口号
    user='root',       # 用户名
    password='123456', # 密码
    database='test'    # 数据库名
)

# 创建游标（执行SQL的工具）
cursor = conn.cursor()

# 执行SQL
cursor.execute("SELECT * FROM user LIMIT 10")
result = cursor.fetchall()  # 获取查询结果
print(result)

# 关闭游标和连接（避免资源泄露）
cursor.close()
conn.close()

Java with JDBC

The usual setup is to import the MySQL JDBC driver package and connect through DriverManager.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;

public class MySQLConnection {
    public static void main(String[] args) {
        String url = "jdbc:mysql://localhost:3306/test?useSSL=false&serverTimezone=UTC";
        String user = "root";
        String password = "123456";

        try {
            // 加载驱动（MySQL 8.0+可省略）
            Class.forName("com.mysql.cj.jdbc.Driver");
            // 建立连接
            Connection conn = DriverManager.getConnection(url, user, password);
            // 创建Statement
            Statement stmt = conn.createStatement();
            // 执行SQL
            ResultSet rs = stmt.executeQuery("SELECT * FROM user LIMIT 10");
            // 处理结果
            while (rs.next()) {
                System.out.println(rs.getString("username"));
            }
            // 关闭资源
            rs.close();
            stmt.close();
            conn.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Connecting with GUI tools

Tools such as Navicat and DBeaver are common in operations and analysis scenarios.

The steps are straightforward:

Create a new connection and choose MySQL.
Fill in host IP, port, username, password, and database name.
Test the connection.
If it succeeds, you can query or manage data visually.

One practical detail: if a firewall is enabled on the server, the MySQL port—usually 3306—must be opened for the relevant external IPs.

Core Python libraries for data analysis and processing

A solid answer usually groups these tools by purpose rather than listing names without context.

Reading and basic processing

Pandas

This is the core library for structured data analysis in Python. Its DataFrame is the standard table-like structure used for reading and cleaning data from CSV, Excel, JSON, and databases. Common operations include missing-value handling, deduplication, type conversion, filtering, aggregation with groupby, and joins with merge.

NumPy

NumPy is the base numerical library in Python. It provides the ndarray structure and efficient array operations for arithmetic, matrix computation, and statistical functions. Many other libraries, including Pandas, depend on it internally.

Visualization

Matplotlib

The foundational plotting library in Python. It supports line charts, bar charts, histograms, scatter plots, pie charts, and other basic visualizations, with detailed control over styles, labels, colors, and fonts.

Seaborn

Built on top of Matplotlib, Seaborn offers cleaner syntax and more polished statistical charts, such as heatmaps, box plots, and violin plots. It is often used for quick and visually appealing analysis output.

Plotly

An interactive visualization library suitable for dashboards and web-based presentation. Charts can be zoomed, hovered, and explored dynamically, and it supports maps and 3D visualization as well.

Big data processing

PySpark

The Python API for Spark. It supports distributed data processing and offers a DataFrame API similar in style to Pandas, but designed for large-scale workloads including reading, cleaning, analysis, and even machine learning.

Database interaction

PyMySQL

A Python library for connecting to MySQL, executing SQL, and reading results.

SQLAlchemy

An ORM library that supports multiple databases such as MySQL, PostgreSQL, and SQLite. It allows database interaction through Python objects and is better suited to more complex application-level data access.

Other practical built-in tools

Datetime

Useful for parsing dates, formatting time values, and calculating time differences.

Python’s regular expression module, used for string matching, extraction, and replacement, such as pulling phone numbers or emails from raw text.

A simple PySpark analysis demo: from raw behavior logs to results

A common interview format is not just “do you know PySpark,” but “describe how you would solve a simple business problem.”

The requirement

Suppose there is user behavior data containing actions such as clicks, views, and orders. The task is to compute:

daily click volume
the top 3 product categories by click count
the click preference of users who placed orders, meaning the product category each such user clicked most often

Implementation idea

A clear verbal explanation can be structured like this:

Prepare the environment

Make sure Spark is configured, whether in local mode or cluster mode, and import the required PySpark classes such as SparkSession and DataFrame functions.

Read the data

Assume the source is a CSV file with fields like user_id, action_time, action_type, product_id, and product_category.

Clean the data

filter out invalid records such as missing user_id
reject invalid timestamps
convert action_time into a date field
keep only relevant actions such as clicks and orders

Perform the analysis

group by date to count daily clicks
group by product category and rank categories to get the top 3
identify users who placed orders, join back to their click records, and determine the most-clicked category per user

Write the output

Save the result into CSV or MySQL for later use.

Stop the SparkSession

Release resources after the job finishes.

Code example

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, count, row_number
from pyspark.sql.window import Window

# 1. 初始化SparkSession（入口）
spark = SparkSession.builder \
    .appName("UserBehaviorAnalysis") \
    .master("local[*]")  # 本地模式，集群模式可省略
    .getOrCreate()

# 2. 读取数据（CSV格式）
df = spark.read.csv(
    path="user_behavior.csv",
    header=True,  # 第一行为列名
    inferSchema=True  # 自动推断数据类型
)

# 3. 数据清洗
# 3.1 过滤无效数据
df_clean = df.filter(
    col("user_id").isNotNull() &
    col("action_time").isNotNull() &
    col("action_type").isin(["click", "order"])  # 只保留点击和下单行为
)

# 3.2 转换日期格式，提取日期字段
df_clean = df_clean.withColumn("action_date", to_date(col("action_time"), "yyyy-MM-dd"))

# 4. 数据分析
# 4.1 统计每日用户点击量
daily_click = df_clean.filter(col("action_type") == "click") \
    .groupBy("action_date") \
    .agg(count("user_id").alias("daily_click_count")) \
    .orderBy("action_date")

# 4.2 各商品类别的点击top3
category_click = df_clean.filter(col("action_type") == "click") \
    .groupBy("product_category") \
    .agg(count("user_id").alias("click_count")) \
    .orderBy(col("click_count").desc())

# 使用窗口函数取top3
window = Window.orderBy(col("click_count").desc())
category_click_top3 = category_click.withColumn("rank", row_number().over(window)) \
    .filter(col("rank") <= 3) \
    .drop("rank")

# 4.3 下单用户的点击偏好
# 先获取下单用户ID
order_users = df_clean.filter(col("action_type") == "order") \
    .select("user_id").distinct()

# 关联下单用户的点击数据，统计每个用户点击最多的商品类别
user_click_preference = df_clean.filter(col("action_type") == "click") \
    .join(order_users, on="user_id", how="inner") \
    .groupBy("user_id", "product_category") \
    .agg(count("*").alias("click_count"))

# 窗口函数取每个用户点击最多的类别（若有并列取第一个）
window_user = Window.partitionBy("user_id").orderBy(col("click_count").desc())
user_click_preference = user_click_preference.withColumn("rank", row_number().over(window_user)) \
    .filter(col("rank") == 1) \
    .drop("rank", "click_count")

# 5. 结果输出（写入CSV，本地模式）
daily_click.write.csv("daily_click_result", header=True, mode="overwrite")
category_click_top3.write.csv("category_click_top3", header=True, mode="overwrite")
user_click_preference.write.csv("user_click_preference", header=True, mode="overwrite")

# 6. 关闭SparkSession
spark.stop()

ClickHouse, Elasticsearch, and how they differ from Hive

What ClickHouse is good at

ClickHouse is an open-source columnar database originally developed by Yandex. It is built for OLAP workloads, where the goal is fast analytical querying over large volumes of data.

Its strengths are:

fast query performance
support for large-scale storage
strong aggregation capability

Typical use cases include log analysis, user behavior analysis, and reporting systems. In many big data platforms, it serves as the query engine for analytics.

What Elasticsearch is good at

Elasticsearch is a distributed search engine built on Lucene and based on an inverted index structure. Its strengths are different from ClickHouse:

powerful full-text search
real-time data ingestion and querying
strong scalability

It is a natural fit for log retrieval, product search, and monitoring or alerting systems. It is also commonly used as part of the ELK stack with Logstash and Kibana.

ClickHouse vs Hive

Although both appear in data platform discussions, their positioning is quite different.

<table> <thead> <tr> <th>Dimension</th> <th>ClickHouse</th> <th>Hive</th> </tr> </thead> <tbody> <tr> <td>Storage structure</td> <td>Columnar storage, so queries read only required columns and reduce IO</td> <td>Traditionally row-oriented at the table level, though it can use columnar file formats such as ORC and Parquet on HDFS</td> </tr> <tr> <td>Primary goal</td> <td>OLAP analytics with low-latency aggregation and statistical queries</td> <td>Offline batch processing, ETL, and warehouse-style analytics</td> </tr> <tr> <td>Data location</td> <td>Local disks across nodes in a distributed cluster</td> <td>HDFS-based storage with Hadoop file system reliability</td> </tr> <tr> <td>Query engine</td> <td>Comes with its own execution engine and SQL support</td> <td>Relies on external engines such as MapReduce, Spark, or Tez; essentially a SQL layer rather than a standalone compute engine</td> </tr> <tr> <td>Real-time capability</td> <td>Supports real-time writes, batch or single-row, and data is queryable immediately</td> <td>Mainly for offline storage; newly added data may require metadata repair operations before querying, so real-time performance is poor</td> </tr> <tr> <td>Typical scenarios</td> <td>Real-time analytics, logs, user behavior analysis, low-latency reporting</td> <td>Offline ETL, batch workloads, and data warehouse construction</td> </tr> <tr> <td>Consistency</td> <td>Some engines support ACID-style transactional guarantees</td> <td>No built-in transaction support; consistency is often enforced at the business logic level</td> </tr> </tbody> </table>

Why ClickHouse queries are so fast

Its performance comes from several design choices working together.

1. Columnar storage

Data is stored by column rather than by row. If a query touches only a few columns, ClickHouse reads only those columns, greatly reducing disk IO. Since values in the same column share the same type, compression is also much more effective, with algorithms such as LZ4 and ZSTD further reducing storage size and read cost.

2. Efficient indexing

ClickHouse supports multiple index mechanisms:

primary key index for faster range scans based on sorted storage
skip indexes for low-cardinality columns such as gender or region, helping skip large blocks of irrelevant data
Bloom filters for quick existence checks that avoid unnecessary reads

3. Vectorized execution

Instead of processing one row at a time, ClickHouse processes data in batches. This reduces function call overhead and CPU context switching, making better use of CPU resources.

4. Partitioning and pre-aggregation

Data can be partitioned by dimensions such as date, which lets queries scan only relevant partitions. It also supports materialized views, which store precomputed aggregates so repeated analytical queries do not need to recalculate everything from raw data.

5. Distributed architecture

In a cluster, data can be sharded across nodes and queried in parallel. Replicas improve both availability and concurrent query capacity.

6. Other execution optimizations

Additional gains come from vectorized SQL functions, reduced unnecessary data copying, and a query optimizer that chooses efficient execution plans.

For interview purposes, the key is not just memorizing definitions. It helps more to answer from a practical angle: what problem the technology solves, how you would identify the issue in a real system, and what trade-offs matter when choosing one approach over another.