How to remote debug using software failure replay

Undo Bytes
3 min readMay 7, 2020

With new social distancing measures and most people now being advised to work from home, what do you do when a key customer reports a failure with your software in production?

Traditional production troubleshooting

Today, your debugging workflow might look like this:

  • Get initial description of problem from customer
  • If issue cannot be resolved remotely, send Field Application Engineer or developer on customer site
  • FAE or dev reviews issue, using artefacts such as logs or core dumps
  • If issue cannot be resolved at this point, an R&D engineer is assigned
  • Logs, core dumps and/or other information is fed back to the R&D engineer for debugging
  • R&D engineer makes a new build containing a speculative fix based on the limited information available and waits for feedback from customer
  • FAE or dev deploys the new build at the customer site
  • Depending on customer feedback, the issue is resolved or another iteration of the workflow takes place
  • FAE or dev confirms fix has resolved the issue and closes support ticket

The guesswork that needs to be spent on speculative bug fixing of this kind is time-consuming and frustrating for everyone involved.

A new world order

In the current climate, you can no longer send your engineers on customer sites. So how are you meant to diagnose and resolve production issues if you can’t access the live environment?

Software failure replay systems offer a solution. Imagine a CCTV camera recording an event. You don’t have to be in the same place at the same time as the recorded event to see what happened. A recording file is produced which can be subsequently replayed and reviewed to figure out what happened.

LiveRecorder is based on this next-gen diagnostic technology. It can capture a failed process (down to instruction level) for offline replay.

Here is how it works when LiveRecorder is embedded into your product when you deploy:

  • When things go wrong in production, ask your customer to activate LiveRecorder to capture the failed process inside a recording and send you the file (note: the recording serves as a standalone reproducible test case)
  • The customer sends you the recording file for you to replay the original environment and debug
  • After analyzing the recording by stepping forwards and backwards in the code execution, you can determine the root cause of the issue, develop a fix, and deploy that fix!
  • Go get yourself a well-deserved coffee and get on with developing the feature you were working on before you got interrupted.
Record. Replay. Resolve. Diagram illustrating this workflow.

The beauty of a recording is that you don’t have to waste time trying to reproduce the problem. The recording represents a 100% reproducible test case. And that recording can be debugged offline, in the comfort of your own home/office. Once the failed process is recorded, you can relive the defect — without impacting the live environment.

LiveRecorder enables engineering teams to debug remotely and debug faster than with traditional debugging methods — allowing business to continue for both software vendors and their customers.

How to create a recording for reverse debugging. What this demo video.

Originally published at https://undo.io.

--

--

Undo Bytes

Undo is the time travel debugging company for Linux. We equip developers with the technology to understand complex code and fix bugs faster. https://undo.io