Big Data 8 min read

Why Spark Jobs Keep Running After You Kill Them: Daemon Threads and Driver Behavior

This article investigates why Spark tasks that appear killed in the Web UI continue running on the driver, analyzes the role of daemon versus non‑daemon threads and SparkContext shutdown mechanisms, reproduces the issue with sample code, and provides practical solutions such as using daemon threads or checking SparkContext status.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
Why Spark Jobs Keep Running After You Kill Them: Daemon Threads and Driver Behavior

Problem Phenomenon

Two tasks killed via the Spark Web UI still have backend processes running, causing the driver machine's CPU usage to remain high.

The backend process looks like this:

Problem Reproduction

Test Program

The demo program consists of three parts:

(1) rdd.map operation – simulates a Spark distributed computation task.

<code>//section 1:
//do something for a long time
val rdd = sparkSession.sparkContext.parallelize(List(1,2,3,4,5)).repartition(4)
val rdd2 = rdd.map(x => {
  for (i <- 1 to 100) {
    for (j <- 1 to 999999999) {
    }
    if (i % 10 == 0) {
      println(i + " rdd map process running!")
    }
  }
  x * 2
})
rdd2.take(10).foreach(println)
</code>

(2) Nested loop on the driver – similar to

rdd.collect

and runs on the driver.

<code>//section 2:
//do something for a long time in driver
for (i <- 1 to 100) {
  for (j <- 1 to 999999999) {
  }
  if (i % 10 == 0) {
    println(i + " main process running!")
  }
}
</code>

(3) Multi‑threaded operation – tests manually created threads on the driver.

<code>//section 3 multi-thread
def runThread() = {
  val t = new Thread(new Runnable {
    override def run(): Unit = {
      while (true) {
        println("Running something!")
        TimeUnit.SECONDS.sleep(1)
      }
    }
  })
  if (daemonFlag.equals("1")) {
    t.setDaemon(true)
  }
  t.start()
}
</code>

The program accepts two arguments:

daemonFlag

(0‑false, 1‑true) and

multi

(0‑no thread, 1‑start thread).

Test Results

The table below summarizes whether the backend process remains after killing the job under different flag combinations. The key observations are:

When no manual thread is started, killing an

rdd.map

job stops the backend, but killing a driver‑side computation leaves the backend alive.

When a non‑daemon thread is started, the backend persists regardless of when the job is killed.

When a daemon thread is started, the behavior matches the first case (backend stops).

Problem Cause

Manual Thread Start

A daemon thread provides background services (e.g., garbage collection) and does not prevent the JVM from exiting. The program will terminate only when all non‑daemon threads finish, which explains why a non‑daemon thread keeps the driver process alive after a kill.

In Spark's source,

stopInNewThread

also creates a daemon thread:

<code>private[spark] def stopInNewThread(): Unit = {
  new Thread("stop-spark-context") {
    setDaemon(true)
    override def run(): Unit = {
      try {
        SparkContext.this.stop()
      } catch {
        case e: Throwable =>
          logError(e.getMessage, e)
          throw e
      }
    }
  }.start()
}
</code>

Solution: When the driver program manually starts threads, set them as daemon threads.

Driver‑Side Program

The Web UI kill action eventually calls

StandaloneSchedulerBackend.dead

, which invokes

sc.stopInNewThread()

inside the

else

branch of the method:

<code>override def dead(reason: String) {
  notifyContext()
  if (!stopping.get) {
    launcherBackend.setState(SparkAppHandle.State.KILLED)
    logError("Application has been killed. Reason: " + reason)
    try {
      scheduler.error(reason)
    } finally {
      // Ensure the application terminates, as we can no longer run jobs.
      sc.stopInNewThread()
    }
  }
}
</code>

The

TaskSchedulerImpl.error

method further propagates the error:

<code>def error(message: String) {
  synchronized {
    if (taskSetsByStageIdAndAttempt.nonEmpty) {
      for {
        attempts <- taskSetsByStageIdAndAttempt.values
        manager <- attempts.values
      } {
        try {
          manager.abort(message)
        } catch {
          case e: Exception => logError("Exception in error callback", e)
        }
      }
    } else {
      throw new SparkException(s"Exiting due to error from cluster scheduler: $message")
    }
  }
}
</code>

Because

sc.stopInNewThread()

runs in a newly created daemon thread, the driver‑side computation may continue after the UI kill.

Solution: Add explicit checks for SparkContext state in driver code, e.g.:

<code>//section 2 (modified)
for (i <- 1 to 100 if !sparkSession.sparkContext.isStopped) {
  for (j <- 1 to 999999999 if !sparkSession.sparkContext.isStopped) {
  }
  if (i % 10 == 0) {
    println(i + " main process running!")
  }
}
</code>

These checks ensure the driver loop exits promptly when the Spark application is killed.

BigDataSparkKillDriverDaemonThread
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.