Skip to content

Commit

Permalink
build: ✨ update local setup (#140)
Browse files Browse the repository at this point in the history
Co-authored-by: Sam Washko <[email protected]>
  • Loading branch information
CandiedCode and swashko authored Jan 31, 2025
1 parent 341582a commit 2226bc4
Show file tree
Hide file tree
Showing 6 changed files with 167 additions and 36 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -132,4 +132,7 @@ cython_debug/

# Notebook Model Downloads
notebooks/PyTorchModels/
pytorch-model-scan-results.json
pytorch-model-scan-results.json

# Code Coverage
cov.xml
61 changes: 48 additions & 13 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,42 +1,77 @@
.DEFAULT_GOAL := help
VERSION ?= $(shell dunamai from git --style pep440 --format "{base}.dev{distance}+{commit}")

install-dev:
.PHONY: env
env: ## Display information about the current environment.
poetry env info

.PHONY: install-dev
install-dev: ## Install all dependencies including dev and test dependencies, as well as pre-commit.
poetry install --with dev --with test --extras "tensorflow h5py"
pre-commit install

install:
.PHONY: install
install: ## Install required dependencies.
poetry install

install-prod:
.PHONY: install-prod
install-prod: ## Install prod dependencies.
poetry install --with prod

install-test:
.PHONY: install-test
install-test: ## Install test dependencies.
poetry install --with test --extras "tensorflow h5py"

clean:
pip uninstall modelscan
.PHONY: clean
clean: ## Uninstall modelscan
python -m pip uninstall modelscan

.PHONY: test
test: ## Run pytests.
poetry run pytest tests/

test:
poetry run pytest
.PHONY: test-cov
test-cov: ## Run pytests with code coverage.
poetry run pytest --cov=modelscan --cov-report xml:cov.xml tests/

build:
.PHONY: build
build: ## Build the source and wheel achive.
poetry build

.PHONY: build-prod
build-prod: version
build-prod: ## Update the version and build wheel archive.
poetry build

version:
.PHONY: version
version: ## Bumps the version of the project.
echo "__version__ = '$(VERSION)'" > modelscan/_version.py
poetry version $(VERSION)

.PHONY: lint
lint: bandit mypy
lint: ## Run all the linters.

bandit:
.PHONY: bandit
bandit: ## Run SAST scanning.
poetry run bandit -c pyproject.toml -r .

mypy:
.PHONY: mypy
mypy: ## Run type checking.
poetry run mypy --ignore-missing-imports --strict --check-untyped-defs .

format:
.PHONY: black
format: ## Run black to format the code.
black .


.PHONY: help
help: ## List all targets and help information.
@grep --no-filename -E '^([a-z.A-Z_%-/]+:.*?)##' $(MAKEFILE_LIST) | sort | \
awk 'BEGIN {FS = ":.*?(## ?)"}; { \
if (length($$1) > 0) { \
printf " \033[36m%-30s\033[0m %s\n", $$1, $$2; \
} else { \
printf "%s\n", $$2; \
} \
}'
24 changes: 15 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@
[![Supported Versions](https://img.shields.io/pypi/pyversions/modelscan.svg)](https://pypi.org/project/modelscan)
[![pypi Version](https://img.shields.io/pypi/v/modelscan)](https://pypi.org/project/modelscan)
[![License: Apache 2.0](https://img.shields.io/crates/l/apa)](https://opensource.org/license/apache-2-0/)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)

# ModelScan: Protection Against Model Serialization Attacks

Machine Learning (ML) models are shared publicly over the internet, within teams and across teams. The rise of Foundation Models have resulted in public ML models being increasingly consumed for further training/fine tuning. ML Models are increasingly used to make critical decisions and power mission-critical applications.
Despite this, models are not yet scanned with the rigor of a PDF file in your inbox.

Expand Down Expand Up @@ -77,10 +80,10 @@ takes for your computer to process the total filesize from disk(seconds in most

ModelScan ranks the unsafe code as:

* CRITICAL
* HIGH
* MEDIUM
* LOW
- CRITICAL
- HIGH
- MEDIUM
- LOW

![ModelScan Flow Chart](/imgs/model_scan_flow_chart.png)

Expand All @@ -104,6 +107,7 @@ At present, ModelScan supports any Pickle derived format and many others:
| Classic ML Libraries (Sklearn, XGBoost etc.) | pickle.dump(), dill.dump(), joblib.dump(), cloudpickle.dump() | Pickle, Cloudpickle, Dill, Joblib | Yes |

### Installation

ModelScan is installed on your systems as a Python package(Python 3.9 to 3.12 supported). As shown from above you can install
it by running this in your terminal:

Expand All @@ -119,6 +123,7 @@ modelscan = ">=0.1.1"
```

Scanners for Tensorflow or HD5 formatted models require installation with extras:

```bash
pip install 'modelscan[ tensorflow, h5py ]'
```
Expand All @@ -129,20 +134,21 @@ ModelScan supports the following arguments via the CLI:

| Usage | Argument | Explanation |
|----------------------------------------------------------------------------------|------------------|---------------------------------------------------------|
| ```modelscan -h ``` | -h or --help | View usage help |
| ```modelscan -v ``` | -v or --version | View version information |
| ```modelscan -h``` | -h or --help | View usage help |
| ```modelscan -v``` | -v or --version | View version information |
| ```modelscan -p /path/to/model_file``` | -p or --path | Scan a locally stored model |
| ```modelscan -p /path/to/model_file --settings-file ./modelscan-settings.toml``` | --settings-file | Scan a locally stored model using custom configurations |
| ```modelscan create-settings-file``` | -l or --location | Create a configurable settings file |
| ```modelscan -r``` | -r or --reporting-format | Format of the output. Options are console, json, or custom (to be defined in settings-file). Default is console |
| ```modelscan -r reporting-format -o file-name``` | -o or --output-file | Optional file name for output report |
| ```modelscan --show-skipped``` | --show-skipped | Print a list of files that were skipped during the scan |


Remember models are just like any other form of digital media, you should scan content from any untrusted source before use.

##### CLI Exit Codes
#### CLI Exit Codes

The CLI exit status codes are:

- `0`: Scan completed successfully, no vulnerabilities found
- `1`: Scan completed successfully, vulnerabilities found
- `2`: Scan failed, modelscan threw an error while scanning
Expand Down Expand Up @@ -201,7 +207,7 @@ Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0
<http://www.apache.org/licenses/LICENSE-2.0>

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
Expand Down
13 changes: 7 additions & 6 deletions docs/model_serialization_attacks.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Machine Learning(ML) models are the foundational asset in ML powered application
Models can be compromised in various ways, some are new like adversarial machine learning methods, others are common with traditional applications like denial of service attacks. While these can be a threat to safely operating an ML powered application, this document focuses on exposing the risk of Model Serialization Attacks.
In a Model Serialization Attack malicious code is added to a model when it is saved, this is also called a code injection attack as well. When any user or system then loads the model for further training or inference the attack code is executed immediately, often with no visible change in behavior to users. This makes the attack a powerful vector and an easy point of entry for attacking broader machine learning components.

To secure ML models, you need to understand what’s inside them and how they are stored on disk in a process called serialization.
To secure ML models, you need to understand what’s inside them and how they are stored on disk in a process called serialization.

ML models are composed of:

Expand All @@ -30,7 +30,7 @@ Before digging into how a Model Serialization Attack works and how to scan for t

## 1. Pickle Variants

**Pickle** and its variants (cloudpickle, dill, joblib) all store objects to disk in a general purpose way. These frameworks are completely ML agnostic and store Python objects as-is.
**Pickle** and its variants (cloudpickle, dill, joblib) all store objects to disk in a general purpose way. These frameworks are completely ML agnostic and store Python objects as-is.

Pickle is the defacto library for serializing ML models for following ML frameworks:

Expand All @@ -47,15 +47,15 @@ Pickle is also used to store vectors/tensors only for following frameworks:
Pickle allows for arbitrary code execution and is highly vulnerable to code injection attacks with very large attack surface. Pickle documentation makes it clear with the following warning:

> **Warning:** The `pickle` module **is not secure**. Only unpickle data you trust.
>
>
>
>
> It is possible to construct malicious pickle data which will **execute
> arbitrary code during unpickling**. Never unpickle data that could have come
> from an untrusted source, or that could have been tampered with.
>
>
> Consider signing data with [hmac](https://docs.python.org/3/library/hmac.html#module-hmac) if you need to ensure that it has not
> been tampered with.
>
>
> Safer serialization formats such as [json](https://docs.python.org/3/library/json.html#module-json) may be more appropriate if
> you are processing untrusted data.
Expand Down Expand Up @@ -129,6 +129,7 @@ With the exception of pickle, these formats cannot execute arbitrary code. Howev
With an understanding of various approaches to model serialization, explore how many popular choices are vulnerable to this attack with an end to end explanation.

# End to end Attack Scenario

1. Internal attacker:
The attack complexity will vary depending on the access trusted to an internal actor.
2. External attacker:
Expand Down
15 changes: 8 additions & 7 deletions docs/severity_levels.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
# modelscan Severity Levels

modelscan classifies potentially malicious code injection attacks in the following four severity levels.
modelscan classifies potentially malicious code injection attacks in the following four severity levels.
<br> </br>

- **CRITICAL:** A model file that consists of unsafe operators/globals that can execute code is classified at critical severity. These operators are:
- exec, eval, runpy, sys, open, breakpoint, os, subprocess, socket, nt, posix
- exec, eval, runpy, sys, open, breakpoint, os, subprocess, socket, nt, posix
<br> </br>
- **HIGH:** A model file that consists of unsafe operators/globals that can not execute code but can still be exploited is classified at high severity. These operators are:
- webbrowser, httplib, request.api, Tensorflow ReadFile, Tensorflow WriteFile
- webbrowser, httplib, request.api, Tensorflow ReadFile, Tensorflow WriteFile
<br> </br>
- **MEDIUM:** A model file that consists of operators/globals that are neither supported by the parent ML library nor are known to modelscan are classified at medium severity.
- Keras Lambda layer can also be used for arbitrary code execution. In general, it is not a best practise to add a Lambda layer to a ML model that can get exploited for code injection attacks.
- Work in Progress: Custom operators will be classified at medium severity.
- **MEDIUM:** A model file that consists of operators/globals that are neither supported by the parent ML library nor are known to modelscan are classified at medium severity.
- Keras Lambda layer can also be used for arbitrary code execution. In general, it is not a best practise to add a Lambda layer to a ML model that can get exploited for code injection attacks.
- Work in Progress: Custom operators will be classified at medium severity.
<br> </br>
- **LOW:** At the moment no operators/globals are classified at low severity level.
- **LOW:** At the moment no operators/globals are classified at low severity level.
Loading

0 comments on commit 2226bc4

Please sign in to comment.