Chapter 5: Eradication and Recovery¶

Introduction¶

Eradication and recovery transform incident response from defensive containment to offensive remediation. While containment limits damage and prevents spread, eradication removes adversary presence from the environment, and recovery restores normal business operations with improved security posture. These phases represent the organization's opportunity to not merely return to the pre-incident state, but to emerge more resilient.

This chapter explores the methodologies, techniques, and decision frameworks required to completely eliminate threats, systematically restore systems, validate security, and return to normal operations while preventing recurrence.

Root Cause Analysis¶

Effective eradication requires understanding not just what happened, but why it happened. Root cause analysis identifies the fundamental vulnerability or control failure that enabled the incident.

The Five Whys Technique¶

A simple but powerful approach to identifying root causes:

Five Whys Example: Ransomware Incident

Problem: Ransomware encrypted file server

Why did ransomware encrypt the file server? Because a workstation was infected and the ransomware spread laterally.
Why did the ransomware spread from the workstation? Because the workstation had SMB access to file servers.
Why did the workstation have unrestricted SMB access? Because network segmentation was not implemented.
Why was network segmentation not implemented? Because it was not prioritized in the security roadmap.
Why was network segmentation not prioritized? Because risk assessment did not adequately evaluate lateral movement risks.

Root Cause: Inadequate risk assessment process failing to identify and prioritize lateral movement prevention controls.

Vulnerability Identification¶

Identify specific weaknesses exploited:

Technical Vulnerabilities: - Unpatched software (CVE identification) - Misconfigured security controls - Weak or default credentials - Excessive user privileges

Process Vulnerabilities: - Inadequate security awareness training - Missing security controls (MFA, EDR, network segmentation) - Insufficient monitoring and logging - Delayed patch management

Architectural Vulnerabilities: - Flat network architecture enabling lateral movement - Single points of failure without redundancy - Inadequate separation between production and administrative networks - Cloud misconfigurations

Document Everything

Comprehensive root cause analysis documentation informs eradication strategy, prevents recurrence, supports lessons learned, and may be required for regulatory reporting or litigation.

Malware Removal¶

Complete eradication requires removing all malicious software and adversary-controlled code from the environment.

Removal Strategies¶

System Reimaging (Preferred): - Wipe and rebuild systems from clean sources - Guarantees removal of all malware including unknown components - Most time-consuming but most thorough

Targeted Removal: - Remove identified malicious files, processes, and registry entries - Faster but risks missing unknown persistence mechanisms - Appropriate only when complete compromise scope is understood

Antivirus/EDR-Based Removal: - Use security tools to quarantine and remove malware - Convenient but may miss sophisticated threats - Should be validated through additional analysis

Rebuild vs. Remediate

For critical systems and sophisticated threats (APT), reimaging is strongly recommended. Targeted removal risks leaving adversary footholds that enable re-compromise.

Persistence Mechanism Elimination¶

Adversaries establish multiple persistence mechanisms—all must be identified and removed:

Windows Persistence Locations: - Registry Run keys (HKCU/HKLM\Software\Microsoft\Windows\CurrentVersion\Run) - Scheduled tasks - Windows services - WMI event subscriptions - DLL hijacking - Bootkit/rootkit (MBR or UEFI modification) - Account creation (backdoor accounts)

Linux Persistence Locations: - Cron jobs - systemd services - .bashrc, .bash_profile, .profile modifications - /etc/rc.local modifications - Kernel modules - SSH authorized_keys

Cross-Platform Persistence: - Compromised legitimate credentials - Web shells on web servers - Backdoored software or scripts - Cloud service accounts and API keys

System Restoration¶

After malware removal, systems must be restored to operational status with improved security posture.

Restoration Approaches¶

Clean Rebuild:

Backup Validation:
Verify backup integrity (checksums, test restores)
Confirm backups predate initial compromise
Scan backups for malware before restoration
System Reinstallation:
Install OS from verified clean media
Apply all security patches before network connection
Install only necessary applications
Harden configuration (disable unnecessary services, apply security baselines)
Data Restoration:
Restore business data from clean backups
Scan restored data for malware
Validate data integrity and completeness
Application Configuration:
Reinstall and reconfigure applications
Apply principle of least privilege
Document all configuration changes

Golden Images

Maintain pre-hardened system images (golden images) that can be quickly deployed during recovery, incorporating security best practices and approved configurations.

Configuration Hardening¶

Implement security improvements during restoration:

Operating System Hardening: - Disable unnecessary services and features - Apply security baselines (CIS Benchmarks, DISA STIGs) - Enable logging and auditing - Configure host-based firewall - Implement application whitelisting

Authentication Strengthening: - Enforce strong password policies - Implement multi-factor authentication (MFA) - Disable local administrator accounts where possible - Implement privileged access management (PAM)

Network Security: - Implement network segmentation - Deploy EDR on all endpoints - Enable Windows Firewall or iptables - Configure DNS filtering

Patch Management¶

Addressing the vulnerabilities that enabled the incident is critical to preventing recurrence.

Emergency Patching¶

Apply patches addressing exploited vulnerabilities immediately:

Prioritization: 1. Vulnerabilities actively exploited in the incident 2. Other critical vulnerabilities on affected systems 3. Related vulnerabilities across similar systems 4. Remaining security updates based on risk

Accelerated Process: - Emergency change control approval - Abbreviated testing (test on representative systems, not full test cycle) - Coordinated deployment to minimize operational impact - Validation of successful patch application

Balance Speed and Stability

While urgency is high, patches must still be tested to avoid introducing system instability. Focus testing on business-critical functions to accelerate while managing risk.

Systematic Patch Remediation¶

Beyond emergency patching, address broader patch gaps:

Patch Audit: - Scan all systems for missing patches - Identify systems with outdated software - Prioritize based on criticality and exposure

Ongoing Patch Management: - Establish regular patch cycles (monthly for standard updates, emergency for critical) - Implement automated patch deployment where feasible - Maintain patch testing environments - Track patch compliance metrics

Validation and Verification¶

Before declaring systems recovered, validate complete threat removal and security restoration.

Malware Removal Validation¶

Multi-Scanner Validation: - Scan with multiple antivirus/EDR products - Different vendors detect different malware variants - VirusTotal scans for files - Memory scanning for runtime detection

Behavioral Monitoring: - Monitor system behavior for malicious activity - Network traffic analysis for C2 communication - Process execution monitoring - File system changes

IOC Sweeps: - Search for known indicators of compromise across environment - Check for file hashes, domain names, IP addresses, registry keys - Use threat intelligence on adversary TTPs - YARA rules for malware family detection

Security Control Validation¶

Technical Control Testing: - Verify EDR is installed, updated, and reporting - Confirm firewall rules are properly configured - Test MFA functionality - Validate logging and SIEM ingestion

Configuration Verification: - Compare system configuration to security baselines - Validate hardening settings applied correctly - Check for unnecessary services or accounts - Verify patch levels

Credential Reset Confirmation: - All affected accounts have new passwords - Service account credentials rotated - API keys and tokens regenerated - Cached credentials cleared

Recovery Acceptance Criteria¶

Define specific criteria for declaring recovery complete:

Criterion	Validation Method	Owner
All malware removed	Multi-scanner clean + 48hr monitoring	IR Team
Vulnerabilities patched	Vulnerability scan showing remediation	IT Operations
Systems hardened	CIS Benchmark compliance scan	Security Team
Credentials reset	Account audit showing password change dates	IT Operations
Monitoring operational	SIEM showing log ingestion from restored systems	SOC
Business functions restored	User acceptance testing	Business Units
No recurrence indicators	7-day monitoring period with no alerts	IR Team

Document Acceptance

Formal sign-off from IR team lead, IT operations, and business stakeholders confirms recovery is complete and systems can return to production.

Phased Recovery Approach¶

Systematic phased recovery minimizes business disruption and enables early detection of incomplete eradication.

Recovery Phases¶

Phase 1: Critical Systems Pilot

Restore small subset of critical systems first
Enhanced monitoring on pilot systems
Rapid detection of any recurrence
Validate restoration process before scaling

Phase 2: Staged Restoration

Restore systems in logical groups
Prioritize by business criticality
Monitor each group before proceeding
Adjust process based on lessons from early groups

Phase 3: Full Environment Recovery

Complete restoration of remaining systems
Maintain enhanced monitoring across environment
User communication and support
Documentation of recovery activities

Phase 4: Return to Normal Operations

Transition from incident response to normal operations
Reduce enhanced monitoring to sustainable levels
Update incident response plans based on lessons learned
Archive incident documentation

Monitoring During Recovery¶

Enhanced monitoring during recovery enables early detection of incomplete eradication or adversary counter-response.

Enhanced Monitoring Activities¶

Threat Hunting: - Proactive searching for indicators of adversary presence - Focus on TTPs used by adversary - Assume compromise mentality - Query across all systems, not just previously affected

Network Traffic Analysis: - Monitor for C2 communication patterns - Analyze outbound connections from recovered systems - Look for data exfiltration indicators - Identify lateral movement attempts

Endpoint Behavior Monitoring: - New process executions - PowerShell and command-line activity - Unsigned or unusual binaries - Privilege escalation attempts - Persistence mechanism creation

Monitoring Duration¶

Initial Intensive Period: 7-14 days of enhanced monitoring with dedicated analyst attention

Extended Surveillance: 30-90 days of automated monitoring with regular review

Permanent Improvements: Incorporate high-value detections into ongoing SOC operations

Leverage Automation

Use SOAR platforms to automate routine monitoring tasks, enabling analysts to focus on complex threat hunting and investigation.

Conclusion¶

Eradication and recovery transform the organization from victim to recovered entity with improved security posture. Success requires thoroughness (complete threat removal), systematic approach (phased restoration with validation), and vigilance (enhanced monitoring to detect incomplete eradication).

Organizations that execute eradication and recovery effectively achieve multiple benefits: (1) complete removal of adversary presence, (2) improved security posture reducing future risk, (3) validated system integrity, and (4) organizational learning that strengthens resilience.

However, the incident response lifecycle does not end with recovery. The next chapter explores post-incident activity—the critical phase where organizations capture lessons learned, improve processes, and ensure regulatory compliance, transforming painful incidents into organizational strength.

Key Takeaways

Root cause analysis identifies why incidents occur, not just what happened
System rebuild is preferred over in-place remediation for thorough eradication
Eliminate all persistence mechanisms—adversaries establish multiple footholds
Apply security patches addressing exploited vulnerabilities immediately
Validate complete malware removal through multi-scanner checks and behavioral monitoring
Use phased recovery approach to detect incomplete eradication early
Enhanced monitoring during recovery detects adversary counter-response
Balance recovery speed with thoroughness—rushing increases recurrence risk