Skip to content

Chapter 5: Eradication and Recovery

Introduction

Eradication and recovery transform incident response from defensive containment to offensive remediation. While containment limits damage and prevents spread, eradication removes adversary presence from the environment, and recovery restores normal business operations with improved security posture. These phases represent the organization's opportunity to not merely return to the pre-incident state, but to emerge more resilient.

This chapter explores the methodologies, techniques, and decision frameworks required to completely eliminate threats, systematically restore systems, validate security, and return to normal operations while preventing recurrence.

Root Cause Analysis

Effective eradication requires understanding not just what happened, but why it happened. Root cause analysis identifies the fundamental vulnerability or control failure that enabled the incident.

The Five Whys Technique

A simple but powerful approach to identifying root causes:

Five Whys Example: Ransomware Incident

Problem: Ransomware encrypted file server

  1. Why did ransomware encrypt the file server? Because a workstation was infected and the ransomware spread laterally.

  2. Why did the ransomware spread from the workstation? Because the workstation had SMB access to file servers.

  3. Why did the workstation have unrestricted SMB access? Because network segmentation was not implemented.

  4. Why was network segmentation not implemented? Because it was not prioritized in the security roadmap.

  5. Why was network segmentation not prioritized? Because risk assessment did not adequately evaluate lateral movement risks.

Root Cause: Inadequate risk assessment process failing to identify and prioritize lateral movement prevention controls.

Vulnerability Identification

Identify specific weaknesses exploited:

Technical Vulnerabilities: - Unpatched software (CVE identification) - Misconfigured security controls - Weak or default credentials - Excessive user privileges

Process Vulnerabilities: - Inadequate security awareness training - Missing security controls (MFA, EDR, network segmentation) - Insufficient monitoring and logging - Delayed patch management

Architectural Vulnerabilities: - Flat network architecture enabling lateral movement - Single points of failure without redundancy - Inadequate separation between production and administrative networks - Cloud misconfigurations

Document Everything

Comprehensive root cause analysis documentation informs eradication strategy, prevents recurrence, supports lessons learned, and may be required for regulatory reporting or litigation.

Malware Removal

Complete eradication requires removing all malicious software and adversary-controlled code from the environment.

Removal Strategies

System Reimaging (Preferred): - Wipe and rebuild systems from clean sources - Guarantees removal of all malware including unknown components - Most time-consuming but most thorough

Targeted Removal: - Remove identified malicious files, processes, and registry entries - Faster but risks missing unknown persistence mechanisms - Appropriate only when complete compromise scope is understood

Antivirus/EDR-Based Removal: - Use security tools to quarantine and remove malware - Convenient but may miss sophisticated threats - Should be validated through additional analysis

Rebuild vs. Remediate

For critical systems and sophisticated threats (APT), reimaging is strongly recommended. Targeted removal risks leaving adversary footholds that enable re-compromise.

Persistence Mechanism Elimination

Adversaries establish multiple persistence mechanisms—all must be identified and removed:

Windows Persistence Locations: - Registry Run keys (HKCU/HKLM\Software\Microsoft\Windows\CurrentVersion\Run) - Scheduled tasks - Windows services - WMI event subscriptions - DLL hijacking - Bootkit/rootkit (MBR or UEFI modification) - Account creation (backdoor accounts)

Linux Persistence Locations: - Cron jobs - systemd services - .bashrc, .bash_profile, .profile modifications - /etc/rc.local modifications - Kernel modules - SSH authorized_keys

Cross-Platform Persistence: - Compromised legitimate credentials - Web shells on web servers - Backdoored software or scripts - Cloud service accounts and API keys

System Restoration

After malware removal, systems must be restored to operational status with improved security posture.

Restoration Approaches

Clean Rebuild:

  1. Backup Validation:
  2. Verify backup integrity (checksums, test restores)
  3. Confirm backups predate initial compromise
  4. Scan backups for malware before restoration

  5. System Reinstallation:

  6. Install OS from verified clean media
  7. Apply all security patches before network connection
  8. Install only necessary applications
  9. Harden configuration (disable unnecessary services, apply security baselines)

  10. Data Restoration:

  11. Restore business data from clean backups
  12. Scan restored data for malware
  13. Validate data integrity and completeness

  14. Application Configuration:

  15. Reinstall and reconfigure applications
  16. Apply principle of least privilege
  17. Document all configuration changes

Golden Images

Maintain pre-hardened system images (golden images) that can be quickly deployed during recovery, incorporating security best practices and approved configurations.

Configuration Hardening

Implement security improvements during restoration:

Operating System Hardening: - Disable unnecessary services and features - Apply security baselines (CIS Benchmarks, DISA STIGs) - Enable logging and auditing - Configure host-based firewall - Implement application whitelisting

Authentication Strengthening: - Enforce strong password policies - Implement multi-factor authentication (MFA) - Disable local administrator accounts where possible - Implement privileged access management (PAM)

Network Security: - Implement network segmentation - Deploy EDR on all endpoints - Enable Windows Firewall or iptables - Configure DNS filtering

Patch Management

Addressing the vulnerabilities that enabled the incident is critical to preventing recurrence.

Emergency Patching

Apply patches addressing exploited vulnerabilities immediately:

Prioritization: 1. Vulnerabilities actively exploited in the incident 2. Other critical vulnerabilities on affected systems 3. Related vulnerabilities across similar systems 4. Remaining security updates based on risk

Accelerated Process: - Emergency change control approval - Abbreviated testing (test on representative systems, not full test cycle) - Coordinated deployment to minimize operational impact - Validation of successful patch application

Balance Speed and Stability

While urgency is high, patches must still be tested to avoid introducing system instability. Focus testing on business-critical functions to accelerate while managing risk.

Systematic Patch Remediation

Beyond emergency patching, address broader patch gaps:

Patch Audit: - Scan all systems for missing patches - Identify systems with outdated software - Prioritize based on criticality and exposure

Ongoing Patch Management: - Establish regular patch cycles (monthly for standard updates, emergency for critical) - Implement automated patch deployment where feasible - Maintain patch testing environments - Track patch compliance metrics

Validation and Verification

Before declaring systems recovered, validate complete threat removal and security restoration.

Malware Removal Validation

Multi-Scanner Validation: - Scan with multiple antivirus/EDR products - Different vendors detect different malware variants - VirusTotal scans for files - Memory scanning for runtime detection

Behavioral Monitoring: - Monitor system behavior for malicious activity - Network traffic analysis for C2 communication - Process execution monitoring - File system changes

IOC Sweeps: - Search for known indicators of compromise across environment - Check for file hashes, domain names, IP addresses, registry keys - Use threat intelligence on adversary TTPs - YARA rules for malware family detection

Security Control Validation

Technical Control Testing: - Verify EDR is installed, updated, and reporting - Confirm firewall rules are properly configured - Test MFA functionality - Validate logging and SIEM ingestion

Configuration Verification: - Compare system configuration to security baselines - Validate hardening settings applied correctly - Check for unnecessary services or accounts - Verify patch levels

Credential Reset Confirmation: - All affected accounts have new passwords - Service account credentials rotated - API keys and tokens regenerated - Cached credentials cleared

Recovery Acceptance Criteria

Define specific criteria for declaring recovery complete:

Criterion Validation Method Owner
All malware removed Multi-scanner clean + 48hr monitoring IR Team
Vulnerabilities patched Vulnerability scan showing remediation IT Operations
Systems hardened CIS Benchmark compliance scan Security Team
Credentials reset Account audit showing password change dates IT Operations
Monitoring operational SIEM showing log ingestion from restored systems SOC
Business functions restored User acceptance testing Business Units
No recurrence indicators 7-day monitoring period with no alerts IR Team

Document Acceptance

Formal sign-off from IR team lead, IT operations, and business stakeholders confirms recovery is complete and systems can return to production.

Phased Recovery Approach

Systematic phased recovery minimizes business disruption and enables early detection of incomplete eradication.

Recovery Phases

Phase 1: Critical Systems Pilot

  • Restore small subset of critical systems first
  • Enhanced monitoring on pilot systems
  • Rapid detection of any recurrence
  • Validate restoration process before scaling

Phase 2: Staged Restoration

  • Restore systems in logical groups
  • Prioritize by business criticality
  • Monitor each group before proceeding
  • Adjust process based on lessons from early groups

Phase 3: Full Environment Recovery

  • Complete restoration of remaining systems
  • Maintain enhanced monitoring across environment
  • User communication and support
  • Documentation of recovery activities

Phase 4: Return to Normal Operations

  • Transition from incident response to normal operations
  • Reduce enhanced monitoring to sustainable levels
  • Update incident response plans based on lessons learned
  • Archive incident documentation

Monitoring During Recovery

Enhanced monitoring during recovery enables early detection of incomplete eradication or adversary counter-response.

Enhanced Monitoring Activities

Threat Hunting: - Proactive searching for indicators of adversary presence - Focus on TTPs used by adversary - Assume compromise mentality - Query across all systems, not just previously affected

Network Traffic Analysis: - Monitor for C2 communication patterns - Analyze outbound connections from recovered systems - Look for data exfiltration indicators - Identify lateral movement attempts

Endpoint Behavior Monitoring: - New process executions - PowerShell and command-line activity - Unsigned or unusual binaries - Privilege escalation attempts - Persistence mechanism creation

Monitoring Duration

Initial Intensive Period: 7-14 days of enhanced monitoring with dedicated analyst attention

Extended Surveillance: 30-90 days of automated monitoring with regular review

Permanent Improvements: Incorporate high-value detections into ongoing SOC operations

Leverage Automation

Use SOAR platforms to automate routine monitoring tasks, enabling analysts to focus on complex threat hunting and investigation.

Conclusion

Eradication and recovery transform the organization from victim to recovered entity with improved security posture. Success requires thoroughness (complete threat removal), systematic approach (phased restoration with validation), and vigilance (enhanced monitoring to detect incomplete eradication).

Organizations that execute eradication and recovery effectively achieve multiple benefits: (1) complete removal of adversary presence, (2) improved security posture reducing future risk, (3) validated system integrity, and (4) organizational learning that strengthens resilience.

However, the incident response lifecycle does not end with recovery. The next chapter explores post-incident activity—the critical phase where organizations capture lessons learned, improve processes, and ensure regulatory compliance, transforming painful incidents into organizational strength.

Key Takeaways

  • Root cause analysis identifies why incidents occur, not just what happened
  • System rebuild is preferred over in-place remediation for thorough eradication
  • Eliminate all persistence mechanisms—adversaries establish multiple footholds
  • Apply security patches addressing exploited vulnerabilities immediately
  • Validate complete malware removal through multi-scanner checks and behavioral monitoring
  • Use phased recovery approach to detect incomplete eradication early
  • Enhanced monitoring during recovery detects adversary counter-response
  • Balance recovery speed with thoroughness—rushing increases recurrence risk