capa: migrate to PyGhidra #21

akh7177 · 2025-02-22T15:02:59Z

akh7177
Feb 22, 2025

I’m Abhyuday Hegde, and I came across the “Migrate to PyGhidra” project @mandiant/flare-gsoc on GitHub, listed for GSoC 2025. I’m interested in contributing to the migration of the capa Ghidra backend to PyGhidra and would like to continue supporting the project beyond the GSoC period.

I have experience with Python and Git/GitHub, but I’m still familiarizing myself with Ghidra and Capa. I’m eager to help with tasks like porting functionality, testing, and updating documentation.

Please let me know if there are any next steps or if you’d like to discuss how I can contribute to the project.

mike-hunhoff · 2025-02-24T18:15:56Z

mike-hunhoff
Feb 24, 2025
Maintainer

Hi @akh7177 , you're in the right spot to get started. Please review our contributor guidance and reach out here if you have any questions.

0 replies

akh7177 · 2025-02-25T00:27:19Z

akh7177
Feb 25, 2025
Author

Hi @mike-hunhoff,

As I understand it, the current Ghidra backend of CAPA was developed using Ghidrathon because Python 3 support was unavailable at the time (2023). However, with recent Ghidra builds, direct access to Ghidra APIs is now possible through the PyGhidra library. Since PyGhidra is distributed along with Ghidra, migrating the backend to PyGhidra seems like a more sustainable long-term approach.

Does my understanding align with the project's goals?

Thanks!

4 replies

mike-hunhoff Mar 3, 2025
Maintainer

Hi @akh7177 , you're exactly right! This project aims to replace Ghidrathon with Ghidra's official Python 3 support, PyGhidra. The main benefits being that users will no longer need to install third party Python 3 support to take advantage of capa's integration with Ghidra and the official support may enable us to add a UI component.

akh7177 Mar 4, 2025
Author

Hello @mike-hunhoff , thanks for confirming!

I went through the capa-ghidra codebase to identify the necessary scripts for updating the API calls. I found that capa_explorer.py is responsible for integrating Capa within Ghidra's UI. While updating this script isn't a hassle for me, I feel it might be redundant since there's an entirely new project focused on adding a new Ghidra Explorer plugin.

Could you share your thoughts on this? I'd appreciate your input.

Thanks!

mike-hunhoff Mar 4, 2025
Maintainer

You are correct - capa_explorer.py is our current solution for UI integration, however, it simply injects capa results into existing Ghidra UI components, e.g. bookmarks, comments, etc. The primary goal of this project is to migrate the code found under https://github.com/mandiant/capa/tree/master/capa/features/extractors/ghidra while the other project is focused on developing a completely new Ghidra UI component for displaying and interacting with capa results, similar to our UI integration for IDA Pro. Our intent is to work these projects independently but this is something we can discuss as part of your project proposal, e.g., maybe we should combine the two projects into a larger project. We're open to ideas and discussion 😄

akh7177 Mar 5, 2025
Author

Great ✨
That makes sense! Let's keep the projects separate for now. Once both are completed, we can revisit and discuss how best to integrate them. Looking forward to collaborating on this :)

akh7177 · 2025-03-03T18:09:02Z

akh7177
Mar 3, 2025
Author

Hello @williballenthin ,

I was re-reading the contributor-guidance and noticed that the last date for application deadline is April 2nd. Is it on April 2nd this year too or is it just a small typo?

1 reply

mike-hunhoff Mar 3, 2025
Maintainer

Hi @akh7177 , this was a typo, good catch. I've updated the page to include GSoC's offical deadline date. Thank you!

akh7177 · 2025-03-05T14:19:19Z

akh7177
Mar 5, 2025
Author

Hello @mike-hunhoff,

I’m thinking of writing unit tests to check all functions in each Ghidra feature extractor script and submitting a unit test report along with my updated code as weekly deliverables. This way, I can ensure that the updated scripts work as expected before moving on to integration tests.

Does this approach sound good to you? I’d appreciate your feedback.

Thanks!

4 replies

mike-hunhoff Mar 5, 2025
Maintainer

Testing is always great! We already have tests for the Ghidra integration (see https://github.com/mandiant/capa/blob/master/tests/test_ghidra_features.py) and our expectation would be for the tests to continue to pass as the existing code is migrated to PyGhidra. A good approach here would be to ensure, weekly, that the existing tests continue to pass as various pieces of the implementation are migrated.

akh7177 Mar 6, 2025
Author

Got it! But I believe I need to modify the test setup such that it allows both Ghidrathon and Pyghidra and ensure that the test detects which backend is available and adjusts accordingly right?

mike-hunhoff Mar 10, 2025
Maintainer

Our end goal is to remove Ghidrathon support entirely. This may require changes to how the tests are invoked, but the test themselves should stay the same.

akh7177 Mar 10, 2025
Author

Yup! Got that

DanielS01ss · 2025-03-08T08:31:43Z

DanielS01ss
Mar 8, 2025

Hi @mike-hunhoff,

My name is Daniel Stanculescu, and I am a first-year master's student in Cybersecurity. I have a strong interest in cybersecurity, and I am currently working as a Junior Researcher. I would love to contribute to this project to gain more knowledge in reverse engineering and further support the organization.

I have experience with Python, as well as Git/GitHub.

I am currently studying reverse engineering because I plan to base my master's thesis on malware detection using machine learning.

For now, I will start by contributing with some small changes to the code. Please let me know if there are any other steps I need to take in order to have a chance to contribute to the Google Summer of Code 2025 program.

Also do I need to start another discussion or can I just write in here?

1 reply

mike-hunhoff Mar 10, 2025
Maintainer

Hi @DanielS01ss, please review our contributor guidance and create a new discussion for us to answer your questions and discuss further 😄 .

akh7177 · 2025-03-10T14:18:10Z

akh7177
Mar 10, 2025
Author

Hello @mike-hunhoff 👋

I spent this Sunday working on my GSoC application! Would you be able to review my draft and share your feedback?
Please let me know your email ID so I can send it over.

Thanks!

1 reply

mike-hunhoff Mar 10, 2025
Maintainer

Hi @akh7177 , awesome! Please create a Google Document containing your application and share it with [email protected].

akh7177 · 2025-03-11T14:24:07Z

akh7177
Mar 11, 2025
Author

Hi @mike-hunhoff,
Is there any strict performance benchmark set for the migration or is a slight delay in execution time acceptable?

1 reply

mike-hunhoff Mar 11, 2025
Maintainer

There is not strict performance benchmark at this time. We expect there to be some delay as PyGhidra introduces a layer of indirection, similar to that of Ghidrathon.

akh7177 · 2025-03-19T16:41:09Z

akh7177
Mar 19, 2025
Author

Hi @mike-hunhoff ,
Is there any mini task related to this project that you would recommend I try out?

1 reply

mike-hunhoff Mar 21, 2025
Maintainer

We don't have any mini tasks related to this project. My recommendation is to experiment writing/running small test scripts using PyGhidra and Ghidrathon to gain an understanding of how both tools operate and what will be required to migrate the code.

akh7177 · 2025-03-21T15:15:28Z

akh7177
Mar 21, 2025
Author

Hello @mike-hunhoff,

I tried migrating just the extract_os function in global_.py to PyGhidra and it works amazing, except for the fact that there is a certain delay when PyGhidra Initiates a connection.

Esentially, I modified the function to accept the program object, supplied via a test script and verified that it returns correct OS for the supplied binaries.
When i execute this with the updated global_.py, I get the OS version in my python interpreter, entirely carried out through PyGhidra.

Please let me know your opinion regarding this!! ✨

Below are the scripts that I modified.

This is the test file that I used

#test.py
import global_
import pyghidra
pyghidra.start()
if pyghidra.started(): print("Pyghidra Started")

with pyghidra.open_program("../../../../tests/data/2bf18d0403677378adad9001b1243211.elf_") as flat_api:
        for os_feature, address in global_.extract_os(flat_api):
                print(f"Detected OS: {os_feature.value}")

This is the modified global_.py. I have included modified part of the code. As a temporary solution, I created a GHIDRAIO class in this file itself.

#global_.py

class GHIDRAIO:

    def __init__(self,flat_api):
        super().__init__()
        self.offset = 0
        self.flat_api = flat_api
        self.bytes_ = self.get_bytes()


    def seek(self, offset, whence=0):
        assert whence == 0
        self.offset = offset

    def read(self, size):
        # Ensure you read only within the extracted bytes
        if self.offset + size > len(self.bytes_):
            logger.debug("Cannot read 0x%x bytes at 0x%x (out of bounds)", size, self.offset)
            return b""

        result = self.bytes_[self.offset:self.offset + size]
        self.offset += size
        return result

    def close(self):
        pass

    def get_bytes(self):
        file_bytes = self.flat_api.getCurrentProgram()
        memory=file_bytes.getMemory()
        get_bytes=memory.getAllFileBytes()[0]

        # getOriginalByte() allows for raw file parsing on the Ghidra side
        # other functions will fail as Ghidra will think that it's reading uninitialized memory
        bytes_ = [get_bytes.getOriginalByte(i) for i in range(get_bytes.getSize())]

        return capa.features.extractors.ghidra.helpers.ints_to_bytes(bytes_)

logger = logging.getLogger(__name__)


def extract_os(flat_api) -> Iterator[tuple[Feature, Address]]:
    program=flat_api.getCurrentProgram()
    format_name: str = program.getExecutableFormat()

    if "PE" in format_name:
        yield OS(OS_WINDOWS), NO_ADDRESS

    elif "ELF" in format_name:
        with contextlib.closing(GHIDRAIO(flat_api)) as f:
            os = capa.features.extractors.elf.detect_elf_os(f)

        yield OS(os), NO_ADDRESS

    else:
        logger.debug("unsupported file format: %s, will not guess OS", format_name)
        return

2 replies

mike-hunhoff Mar 21, 2025
Maintainer

Awesome, thank you for sharing your code. The challenge here is going to be running capa within PyGhidra's context manager, as discussed in mandiant/capa#2600 (comment). Ideally, we'd run capa within an environment similar to Ghidrathon where Ghidra APIs, etc. are accessible via the Python __builtins__ or similar. I'm not sure if this is possible with PyGhidra, but it's something we'll need to explore as part of this project. I'd like to see your thoughts on this in your application, including how you would recommend approaching it.

akh7177 Mar 21, 2025
Author

Thanks for the reply! I'll make sure to include my views about this in my application 🙌
I've executed the implementation of the above mentioned migration (with a few minor changes) in Ghidra Script Manager. Please have a look at it

akh7177 · 2025-03-21T21:22:54Z

akh7177
Mar 21, 2025
Author

Hi @mike-hunhoff,

Based on my understanding, by running this script in the Ghidra Script Manager, I am essentially providing it with the same environment that Ghidrathon uses, giving it PyGhidra’s context with native Ghidra API access. This is something we aim to maintain throughout the implementation, rather than executing capa-ghidra from outside the Ghidra environment. Am I correct?

0 replies

capa: migrate to PyGhidra #21

akh7177 Feb 22, 2025

Replies: 10 comments · 15 replies

mike-hunhoff Feb 24, 2025 Maintainer

akh7177 Feb 25, 2025 Author

mike-hunhoff Mar 3, 2025 Maintainer

akh7177 Mar 4, 2025 Author

mike-hunhoff Mar 4, 2025 Maintainer

akh7177 Mar 5, 2025 Author

akh7177 Mar 3, 2025 Author

mike-hunhoff Mar 3, 2025 Maintainer

akh7177 Mar 5, 2025 Author

mike-hunhoff Mar 5, 2025 Maintainer

akh7177 Mar 6, 2025 Author

mike-hunhoff Mar 10, 2025 Maintainer

akh7177 Mar 10, 2025 Author

DanielS01ss Mar 8, 2025

mike-hunhoff Mar 10, 2025 Maintainer

akh7177 Mar 10, 2025 Author

mike-hunhoff Mar 10, 2025 Maintainer

akh7177 Mar 11, 2025 Author

mike-hunhoff Mar 11, 2025 Maintainer

akh7177 Mar 19, 2025 Author

mike-hunhoff Mar 21, 2025 Maintainer

akh7177 Mar 21, 2025 Author

mike-hunhoff Mar 21, 2025 Maintainer

akh7177 Mar 21, 2025 Author

akh7177 Mar 21, 2025 Author

akh7177
Feb 22, 2025

Replies: 10 comments 15 replies

mike-hunhoff
Feb 24, 2025
Maintainer

akh7177
Feb 25, 2025
Author

mike-hunhoff Mar 3, 2025
Maintainer

akh7177 Mar 4, 2025
Author

mike-hunhoff Mar 4, 2025
Maintainer

akh7177 Mar 5, 2025
Author

akh7177
Mar 3, 2025
Author

mike-hunhoff Mar 3, 2025
Maintainer

akh7177
Mar 5, 2025
Author

mike-hunhoff Mar 5, 2025
Maintainer

akh7177 Mar 6, 2025
Author

mike-hunhoff Mar 10, 2025
Maintainer

akh7177 Mar 10, 2025
Author

DanielS01ss
Mar 8, 2025

mike-hunhoff Mar 10, 2025
Maintainer

akh7177
Mar 10, 2025
Author

mike-hunhoff Mar 10, 2025
Maintainer

akh7177
Mar 11, 2025
Author

mike-hunhoff Mar 11, 2025
Maintainer

akh7177
Mar 19, 2025
Author

mike-hunhoff Mar 21, 2025
Maintainer

akh7177
Mar 21, 2025
Author

mike-hunhoff Mar 21, 2025
Maintainer

akh7177 Mar 21, 2025
Author

akh7177
Mar 21, 2025
Author