The Stealer Log Ecosystem: Processing Millions of Credentials a Day

Introduction

Over the last several years, the stealer ecosystem has evolved in several aspects, from the malware families to the platforms used. Close to a year ago, we began monitoring several of these platforms and building a system to ingest the data shared in the hopes of helping the victims.

This research documents our findings and the technical challenges we faced while processing millions of credentials daily.

Understanding the Stealer Log Ecosystem

Stealer malware represents one of the most pervasive threats in today's cybersecurity landscape. These malicious programs are designed to extract sensitive information from infected systems, including:

Browser credentials and session cookies
Cryptocurrency wallet data
Two-factor authentication codes
System information and screenshots
FTP and SSH credentials

The Lifecycle of Stolen Data

The journey of stolen credentials follows a predictable pattern:

Infection: Victims are compromised through phishing, malvertising, or software supply chain attacks
Exfiltration: The stealer collects and transmits data to command and control servers
Processing: Threat actors organize and validate the stolen data
Distribution: Credentials are sold or shared through various channels

Designing Our System

Building a system to process this volume of data required careful architectural decisions. We needed to handle:

# Daily ingestion statistics (approximate)
daily_logs = {
    "raw_credential_pairs": 2_500_000,
    "unique_domains": 150_000,
    "telegram_channels": 500,
    "processing_time_hours": 4
}

Architecture Overview

Our system consists of several key components:

Component	Purpose	Technology
Crawler	Telegram channel monitoring	Python, Telethon
Parser	Log file extraction	Rust
Deduplication	Remove duplicates	Redis, BloomFilter
Storage	Credential indexing	PostgreSQL, Elasticsearch
API	Query interface	FastAPI

Crawling Telegram

Telegram has become a primary distribution channel for stealer logs. We implemented a crawler that monitors hundreds of channels in real-time.

Challenges We Faced

"The scale of the problem is staggering. In a single day, we observed over 50 channels sharing fresh credential dumps, each containing anywhere from 1,000 to 100,000 entries."

Key technical challenges included:

Rate limiting: Telegram's API has strict rate limits
Channel discovery: Finding new malicious channels
File format variations: Each stealer family uses different output formats
Language barriers: Channels operate in multiple languages

Sample Log Structure

A typical stealer log contains structured data like this:

URL: https://example.com/login
Username: user@email.com
Password: P@ssw0rd123!
Application: Chrome
IP: 192.168.1.1
Country: United States

Handling Duplicate Data

One of our biggest challenges was deduplication. The same credentials often appear across multiple channels, sometimes weeks apart.

Our Approach

We implemented a multi-layer deduplication strategy:

class DeduplicationPipeline:
    def __init__(self):
        self.bloom_filter = ScalableBloomFilter(
            initial_capacity=10_000_000,
            error_rate=0.001
        )
        self.redis_cache = Redis(decode_responses=True)
    
    def is_duplicate(self, credential_hash: str) -> bool:
        # First check: Bloom filter (fast, may have false positives)
        if credential_hash not in self.bloom_filter:
            return False
        
        # Second check: Redis for recent entries
        if self.redis_cache.exists(f"cred:{credential_hash}"):
            return True
        
        # Third check: Database for historical
        return self.db_check(credential_hash)

Finding Malicious Channels

Discovering new distribution channels is an ongoing effort. We use several techniques:

Forward analysis: Following message forwards to find source channels
User overlap: Analyzing shared administrators across channels
Keyword monitoring: Tracking specific malware family names
Link analysis: Extracting and following invite links

Building Our System

The final architecture processes data in real-time with the following flow:

Telegram Channels → Crawler → Parser → Deduplication → Storage → API
                                            ↓
                                    Notification Service
                                            ↓
                                    Affected Organizations

Performance Metrics

After six months of operation:

Total credentials processed: 450+ million
Unique credentials identified: 120+ million
Organizations notified: 5,000+
Average processing latency: < 5 minutes

Conclusion

The stealer log ecosystem continues to evolve, with new malware families and distribution methods emerging regularly. Our system has proven effective at processing large volumes of data, but the cat-and-mouse game between defenders and attackers shows no signs of slowing.

Key Takeaways

Stealer malware is a significant and growing threat
Real-time monitoring of distribution channels is essential
Efficient deduplication is critical at scale
Proactive notification helps organizations respond faster

Final Stats

Metric	Value
Daily Processing Capacity	5M credentials
Storage Used	2.5 TB
API Queries/Day	50,000
Uptime	99.9%

This research was conducted for defensive purposes to help organizations identify compromised credentials and protect their users.

Introduction

This research documents our findings and the technical challenges we faced while processing millions of credentials daily.

Understanding the Stealer Log Ecosystem

Browser credentials and session cookies
Cryptocurrency wallet data
Two-factor authentication codes
System information and screenshots
FTP and SSH credentials

The Lifecycle of Stolen Data

The journey of stolen credentials follows a predictable pattern:

Infection: Victims are compromised through phishing, malvertising, or software supply chain attacks
Exfiltration: The stealer collects and transmits data to command and control servers
Processing: Threat actors organize and validate the stolen data
Distribution: Credentials are sold or shared through various channels

Designing Our System

Building a system to process this volume of data required careful architectural decisions. We needed to handle:

# Daily ingestion statistics (approximate)
daily_logs = {
    "raw_credential_pairs": 2_500_000,
    "unique_domains": 150_000,
    "telegram_channels": 500,
    "processing_time_hours": 4
}

Architecture Overview

Our system consists of several key components:

Component	Purpose	Technology
Crawler	Telegram channel monitoring	Python, Telethon
Parser	Log file extraction	Rust
Deduplication	Remove duplicates	Redis, BloomFilter
Storage	Credential indexing	PostgreSQL, Elasticsearch
API	Query interface	FastAPI

Crawling Telegram

Telegram has become a primary distribution channel for stealer logs. We implemented a crawler that monitors hundreds of channels in real-time.

Challenges We Faced

"The scale of the problem is staggering. In a single day, we observed over 50 channels sharing fresh credential dumps, each containing anywhere from 1,000 to 100,000 entries."

Key technical challenges included:

Rate limiting: Telegram's API has strict rate limits
Channel discovery: Finding new malicious channels
File format variations: Each stealer family uses different output formats
Language barriers: Channels operate in multiple languages

Sample Log Structure

A typical stealer log contains structured data like this:

URL: https://example.com/login
Username: user@email.com
Password: P@ssw0rd123!
Application: Chrome
IP: 192.168.1.1
Country: United States

Handling Duplicate Data

One of our biggest challenges was deduplication. The same credentials often appear across multiple channels, sometimes weeks apart.

Our Approach

We implemented a multi-layer deduplication strategy:

class DeduplicationPipeline:
    def __init__(self):
        self.bloom_filter = ScalableBloomFilter(
            initial_capacity=10_000_000,
            error_rate=0.001
        )
        self.redis_cache = Redis(decode_responses=True)
    
    def is_duplicate(self, credential_hash: str) -> bool:
        # First check: Bloom filter (fast, may have false positives)
        if credential_hash not in self.bloom_filter:
            return False
        
        # Second check: Redis for recent entries
        if self.redis_cache.exists(f"cred:{credential_hash}"):
            return True
        
        # Third check: Database for historical
        return self.db_check(credential_hash)

Finding Malicious Channels

Discovering new distribution channels is an ongoing effort. We use several techniques:

Forward analysis: Following message forwards to find source channels
User overlap: Analyzing shared administrators across channels
Keyword monitoring: Tracking specific malware family names
Link analysis: Extracting and following invite links

Building Our System

The final architecture processes data in real-time with the following flow:

Telegram Channels → Crawler → Parser → Deduplication → Storage → API
                                            ↓
                                    Notification Service
                                            ↓
                                    Affected Organizations

Performance Metrics

After six months of operation:

Total credentials processed: 450+ million
Unique credentials identified: 120+ million
Organizations notified: 5,000+
Average processing latency: < 5 minutes

Conclusion

Key Takeaways

Stealer malware is a significant and growing threat
Real-time monitoring of distribution channels is essential
Efficient deduplication is critical at scale
Proactive notification helps organizations respond faster

Final Stats

Metric	Value
Daily Processing Capacity	5M credentials
Storage Used	2.5 TB
API Queries/Day	50,000
Uptime	99.9%

This research was conducted for defensive purposes to help organizations identify compromised credentials and protect their users.

The Stealer Log Ecosystem: Processing Millions of Credentials a Day

Share

The Stealer Log Ecosystem: Processing Millions of Credentials a Day

Share

Introduction

Understanding the Stealer Log Ecosystem

The Lifecycle of Stolen Data

Designing Our System

Architecture Overview

Crawling Telegram

Challenges We Faced

Sample Log Structure

Handling Duplicate Data

Our Approach

Finding Malicious Channels

Building Our System

Performance Metrics

Conclusion

Key Takeaways

Final Stats

Introduction

Understanding the Stealer Log Ecosystem

The Lifecycle of Stolen Data

Designing Our System

Architecture Overview

Crawling Telegram

Challenges We Faced

Sample Log Structure

Handling Duplicate Data

Our Approach

Finding Malicious Channels

Building Our System

Performance Metrics

Conclusion

Key Takeaways

Final Stats