ClamAV on Steroids: 35,000 YARA Rules and a Lot of Attitude

You can test it here: av.sandkiste.io

Introduction

If you’re anything like me, you’ve probably had one of those random late-night thoughts:

What if I built a scalable cluster of ClamAV instances, loaded it up with 35,000 YARA rules, and used it to really figure out what a file is capable of , whether it’s actually a virus or just acting suspicious?

It’s the kind of idea that starts as a “wouldn’t it be cool” moment and then slowly turns into “well… now I have to build it.

And if that thought has never crossed your mind, that’s fine – because I’m going to walk you through it anyway.

How it Started

Like many of my projects, this one was born out of pure anger.

I was told, with a straight face, that scaling our ClamAV cluster into something actually usable would take multiple people, several days, extra resources, and probably outside help.

I told them I would do this in an afternoon, fully working, with REST API and Frontend

They laughed.

That same afternoon, I shipped the app.

How It’s Going

Step one: You upload a file.

The scanner gets to work and you wait for it to finish:

Once it’s done, you can dive straight into the results:

That first result was pretty boring.

So, I decided to spice things up by testing the Windows 11 Download Helper tool, straight from Microsoft’s own website.

You can see it’s clean , but it does have a few “invasive” features.

Most of these are perfectly normal for installer tools.

This isn’t a sandbox in the traditional sense. YARA rules simply scan the text inside files, looking for certain patterns or combinations, and then infer possible capabilities. A lot of the time, that’s enough to give you interesting insights, but it’s not a replacement for a full sandbox if you really want to see what the file can do in action.

The Setup

Here’s what you need to get this running:

  • HAProxy: for TLS-based load balancing
  • 2 ClamAV instances: plus a third dedicated to updating definitions
  • Malcontent: YARA Scanner
  • Database: to store scan results

You’ll also need a frontend and an API… but we’ll get to that part soon.

YAML
services:

  haproxy:
    image: haproxy:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:3310:3310"
    volumes:
      - ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro
    networks:
      - clam-net
    depends_on:
      - clamd1
      - clamd2

  clamd1:
    image: clamav/clamav-debian:latest
    restart: unless-stopped
    networks:
      - clam-net
    volumes:
      - ./tmp/uploads:/scandir
      - clamav-db:/var/lib/clamav
    command: ["clamd", "--foreground=true"]

  clamd2:
    image: clamav/clamav-debian:latest
    restart: unless-stopped
    networks:
      - clam-net
    volumes:
      - ./tmp/uploads:/scandir
      - clamav-db:/var/lib/clamav
    command: ["clamd", "--foreground=true"]

  freshclam:
    image: clamav/clamav-debian:latest
    restart: unless-stopped
    networks:
      - clam-net
    volumes:
      - clamav-db:/var/lib/clamav
    command: ["freshclam", "-d", "--foreground=true", "--checks=24"]

  mariadb:
    image: mariadb:latest
    restart: unless-stopped
    environment:
      MARIADB_ROOT_PASSWORD: SECREEEEEEEET
      MARIADB_DATABASE: avscanner
      MARIADB_USER: avuser
      MARIADB_PASSWORD: SECREEEEEEEET2
    volumes:
      - mariadb-data:/var/lib/mysql
    ports:
      - "127.0.0.1:3306:3306"

volumes:
  mariadb-data:
  clamav-db:

networks:
  clam-net:

Here’s my haproxy.cfg:

haproxy.cfg
global
    daemon
    maxconn 256

defaults
    mode tcp
    timeout connect 5s
    timeout client  50s
    timeout server  50s

frontend clamscan
    bind *:3310
    default_backend clamd_pool

backend clamd_pool
    balance roundrobin
    server clamd1 clamd1:3310 check
    server clamd2 clamd2:3310 check

Now you’ve got yourself a fully functioning ClamAV cluster, yay 🦄🎉!

FastAPI

I’m not going to dive deep into setting up an API with FastAPI (their docs cover that really well), but here’s the code I use:

Python
@app.post("/upload")
async def upload_and_scan(files: List[UploadFile] = File(...)):
    results = []

    for file in files:
        upload_id = str(uuid.uuid4())
        filename = f"{upload_id}_{file.filename}"
        temp_path = UPLOAD_DIR / filename

        with temp_path.open("wb") as f_out:
            shutil.copyfileobj(file.file, f_out)

        try:
            result = scan_and_store_file(
                file_path=temp_path,
                original_filename=file.filename,
            )
            results.append(result)
        finally:
            temp_path.unlink(missing_ok=True)

    return {"success": True, "data": {"result": results}}

There’s a lot more functionality in other functions, but here’s the core flow:

  1. Save the uploaded file to a temporary path
  2. Check if the file’s hash is already in the database (if yes, return cached results)
  3. Use pyclamd to submit the file to our ClamAV cluster
  4. Run Malcontent as the YARA scanner
  5. Store the results in the database
  6. Delete the file

Here’s how I use Malcontent in my MVP:

Python
def analyze_capabilities(filepath: Path) -> dict[str, Any]:
    path = Path(filepath).resolve()
    if not path.exists() or not path.is_file():
        raise FileNotFoundError(f"File not found: {filepath}")

    cmd = [
        "docker",
        "run",
        "--rm",
        "-v",
        f"{path.parent}:/scan",
        "cgr.dev/chainguard/malcontent:latest",
        "--format=json",
        "analyze",
        f"/scan/{path.name}",
    ]

    try:
        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
        return json.loads(result.stdout)
    except subprocess.CalledProcessError as e:
        raise RuntimeError(f"malcontent failed: {e.stderr.strip()}") from e
    except json.JSONDecodeError as e:
        raise ValueError(f"Invalid JSON output from malcontent: {e}") from e

I’m not going to get into the whole frontend, it just talks to the API and makes things look nice.

For status updates, I use long polling instead of WebSockets. Other than that, it’s all pretty straightforward.

Final Thoughts

I wanted something that could handle large files too and so far, this setup delivers, since files are saved locally. For a production deployment, I’d recommend using something like Kata Containers, which is my go-to for running sketchy, untrusted workloads safely.

Always handle malicious files with caution. In this setup, you’re not executing anything, so you should mostly be safe, but remember, AV systems themselves can be exploited, so stay careful.

As for detection, I don’t think ClamAV alone is enough for solid malware protection. It’s better than nothing, but its signatures aren’t updated as frequently as I’d like. For a truly production-grade solution, I’d probably buy a personal AV product, build my own cluster and CLI tool for it, and plug that in. Most licenses let you use multiple devices, so you could easily scale to 10 workers for about €1.50 a month (just grab a license from your preferred software key site).

Of course, this probably violates license terms. I’m not a lawyer 😬

Anyway, I just wanted to show you something I built, so I built it, and now I’m showing it.

One day, this will be part of my Sandkiste tool suite. I’m also working on a post about another piece of Sandkiste I call “Data Loss Containment”, but that one’s long and technical, so it might take a while.

Love ya, thanks for reading, byeeeeeeee ❤️

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *