
Over the last several years, the stealer ecosystem has evolved in several aspects, from the malware families to the platforms used. Close to a year ago, we began monitoring several of these platforms and building a system to ingest the data shared in the hopes of helping the victims.
This research documents our findings and the technical challenges we faced while processing millions of credentials daily.
Stealer malware represents one of the most pervasive threats in today's cybersecurity landscape. These malicious programs are designed to extract sensitive information from infected systems, including:
The journey of stolen credentials follows a predictable pattern:
Building a system to process this volume of data required careful architectural decisions. We needed to handle:
# Daily ingestion statistics (approximate)
daily_logs = {
"raw_credential_pairs": 2_500_000,
"unique_domains": 150_000,
"telegram_channels": 500,
"processing_time_hours": 4
}
Our system consists of several key components:
| Component | Purpose | Technology |
|---|---|---|
| Crawler | Telegram channel monitoring | Python, Telethon |
| Parser | Log file extraction | Rust |
| Deduplication | Remove duplicates | Redis, BloomFilter |
| Storage | Credential indexing | PostgreSQL, Elasticsearch |
| API | Query interface | FastAPI |
Telegram has become a primary distribution channel for stealer logs. We implemented a crawler that monitors hundreds of channels in real-time.
"The scale of the problem is staggering. In a single day, we observed over 50 channels sharing fresh credential dumps, each containing anywhere from 1,000 to 100,000 entries."
Key technical challenges included:
A typical stealer log contains structured data like this:
URL: https://example.com/login
Username: user@email.com
Password: P@ssw0rd123!
Application: Chrome
IP: 192.168.1.1
Country: United States
One of our biggest challenges was deduplication. The same credentials often appear across multiple channels, sometimes weeks apart.
We implemented a multi-layer deduplication strategy:
class DeduplicationPipeline:
def __init__(self):
self.bloom_filter = ScalableBloomFilter(
initial_capacity=10_000_000,
error_rate=0.001
)
self.redis_cache = Redis(decode_responses=True)
def is_duplicate(self, credential_hash: str) -> bool:
# First check: Bloom filter (fast, may have false positives)
if credential_hash not in self.bloom_filter:
return False
# Second check: Redis for recent entries
if self.redis_cache.exists(f"cred:{credential_hash}"):
return True
# Third check: Database for historical
return self.db_check(credential_hash)
Discovering new distribution channels is an ongoing effort. We use several techniques:
The final architecture processes data in real-time with the following flow:
Telegram Channels → Crawler → Parser → Deduplication → Storage → API
↓
Notification Service
↓
Affected Organizations
After six months of operation:
The stealer log ecosystem continues to evolve, with new malware families and distribution methods emerging regularly. Our system has proven effective at processing large volumes of data, but the cat-and-mouse game between defenders and attackers shows no signs of slowing.
| Metric | Value |
|---|---|
| Daily Processing Capacity | 5M credentials |
| Storage Used | 2.5 TB |
| API Queries/Day | 50,000 |
| Uptime | 99.9% |
This research was conducted for defensive purposes to help organizations identify compromised credentials and protect their users.