Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for analysis of source code/scripted languages #1080

Draft
wants to merge 51 commits into
base: master
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
bbd3f70
Added initial capa control flow for scripts in C#.
adamstorek Jun 27, 2022
8173397
Implemented some further basic TreeSitter Extractor-related concepts …
adamstorek Jun 27, 2022
428f6bc
Modified mypy config file to ignore tree-sitter's missing exports.
adamstorek Jun 28, 2022
a6d7ba2
Implemented core tree sitter engine component with C# queries that se…
adamstorek Jun 28, 2022
80bf78b
Implemented script global extraction handlers (mostly wrapping existi…
adamstorek Jun 28, 2022
cf3dc7e
Reworked format parsing to align better with the rest of capa logic.
adamstorek Jun 28, 2022
9d7f575
Implemented a large part of the C# functionality; refactored the Tree…
adamstorek Jun 29, 2022
3d4b4ec
Added function-level feature extraction.
adamstorek Jun 30, 2022
eca7ead
Bug fixes and code refactoring of the Tree Sitter extractor.
adamstorek Jun 30, 2022
5fd953f
Added tree_sitter to requirements in setup.py.
adamstorek Jun 30, 2022
1f79db9
Added tests for TreeSitterExtractorEngine initialization, new object …
adamstorek Jul 1, 2022
a58bc0b
Added more TreeSitterExtractorEngine tests for pure C#.
adamstorek Jul 1, 2022
5ddb8ba
Added last remaining tests for the TreeSitterExtractorEngine class an…
adamstorek Jul 1, 2022
31e2fb9
Reverted yielding only non-empty strings in order to stay consistent …
adamstorek Jul 5, 2022
5bf3f18
Removing functions that should not be used in tree-sitter extractor (…
adamstorek Jul 5, 2022
a4529fc
Modifying extraction of global statements to omit local function decl…
adamstorek Jul 5, 2022
d5de9a1
Added script language feature to freeze.
adamstorek Jul 5, 2022
6c10458
Added test cases for TS Extractor.
adamstorek Jul 5, 2022
9bd9824
Refactored query bindings.
adamstorek Jul 6, 2022
2594849
Added support for template parsing.
adamstorek Jul 6, 2022
619ed94
Added support for HTML parsing.
adamstorek Jul 6, 2022
5e23802
Implemented the necessary modifications to support embedded templates…
adamstorek Jul 7, 2022
5d83e8d
Added more buildings to build; minor style improvement.
adamstorek Jul 7, 2022
9570523
Further refactored the Tree-sitter queries and fixed minor template e…
adamstorek Jul 7, 2022
7c5e6e3
Refactored extractor engine tests and began adding new template tests.
adamstorek Jul 7, 2022
1e0326a
Added new tests for embedded template testing and refactored a few al…
adamstorek Jul 8, 2022
ca1939f
Bug fixes in extractor and HTML Tree-sitter engine.
adamstorek Jul 8, 2022
d7ab2db
Fixed important namespace-parsing bugs.
adamstorek Jul 11, 2022
5cfbecc
Further improvement to namespace parsing, including default namespace…
adamstorek Jul 11, 2022
26cc1bc
Added more tests and a few minor bug fixes.
adamstorek Jul 11, 2022
2a9e76f
Added language-specific integer parsing.
adamstorek Jul 12, 2022
672ca71
Fixed an important bug in FileOffsetRangeAddress comparison method.
adamstorek Jul 12, 2022
ca426ca
Added more ASPX tests.
adamstorek Jul 12, 2022
fd80277
Fixed the capa control flow to fully support capa scripts.
adamstorek Jul 12, 2022
d0c4acb
Major changes: switching imports and function names to properties, st…
adamstorek Jul 18, 2022
ad31d83
Fixed property-extraction bugs.
adamstorek Jul 19, 2022
e52a9b3
Added few more test cases.
adamstorek Jul 19, 2022
b27713b
Minor style improvements.
adamstorek Jul 19, 2022
b2df2b0
Removed deprecated parse_integer.
adamstorek Jul 19, 2022
a0379a6
Added more tests; fixed integer parsing related bugs.
adamstorek Jul 19, 2022
eeecb63
Fixing address range bug; refactoring and cleanup.
adamstorek Jul 20, 2022
cebc5e1
Incorporated more tests.
adamstorek Jul 20, 2022
d7dcc94
Added support for Python.
adamstorek Jul 26, 2022
32dc5ff
Added more python test cases; fixed a number of python bugs; further …
adamstorek Jul 29, 2022
5e85a6e
Implemented namespace aliasing; further refactored the codebase.
adamstorek Aug 2, 2022
614900f
Refactored/simplified parts of the codebase to improve readability; a…
adamstorek Aug 3, 2022
bb08181
Implemented script language auto-detection.
adamstorek Aug 3, 2022
1fd9d4a
Removed a spurious import.
adamstorek Aug 3, 2022
7ba978f
Added more test cases; moved script language feature to global featur…
adamstorek Aug 5, 2022
25cf09b
Introduced auto-detection to template-script parsing, builtins namesp…
adamstorek Aug 10, 2022
e693573
Attempted to implement the class extraction as specified last Friday …
adamstorek Aug 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Implemented the necessary modifications to support embedded templates…
…/html: aspx.
adamstorek committed Jul 19, 2022

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
commit 5e2380234f46c0117ebfff1340e160c61a8dd22e
2 changes: 1 addition & 1 deletion capa/features/extractors/script.py
Original file line number Diff line number Diff line change
@@ -26,7 +26,7 @@ def extract_format() -> Iterator[Tuple[Feature, Address]]:
yield Format(FORMAT_SCRIPT), NO_ADDRESS


def get_language_from_ext(path: str):
def get_language_from_ext(path: str) -> str:
if path.endswith((".aspx", "aspx_")):
return LANG_TEM
if path.endswith((".cs", ".cs_")):
98 changes: 54 additions & 44 deletions capa/features/extractors/ts/engine.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
import re
from typing import Dict, List, Tuple, Union, Iterator
from collections import defaultdict
from dataclasses import dataclass
from typing import List, Tuple, Iterator, Optional

from tree_sitter import Node, Tree, Parser

import capa.features.extractors.ts.sig
import capa.features.extractors.ts.build
from capa.features.address import FileOffsetRangeAddress
from capa.features.extractors.script import LANG_CS, LANG_JS
from capa.features.extractors.script import LANG_CS, LANG_JS, LANG_TEM, LANG_HTML
from capa.features.extractors.ts.query import (
QueryBinding,
HTMLQueryBinding,
@@ -21,18 +19,14 @@
class TreeSitterBaseEngine:
buf: bytes
language: str
path: str
query: QueryBinding
tree: Tree

def __init__(self, language: str, path: str):
def __init__(self, language: str, buf: bytes):
capa.features.extractors.ts.build.ts_build()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm lets find a better place for this initialization

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global in this file is a good place to start

self.language = language
self.query = QueryBindingFactory.from_language(language)
self.import_signatures = capa.features.extractors.ts.sig.load_import_signatures(language)
self.path = path
with open(self.path, "rb") as f:
self.buf = f.read()
self.buf = buf
self.tree = self.parse()

def parse(self) -> Tree:
@@ -46,19 +40,27 @@ def get_byte_range(self, node: Node) -> bytes:
def get_range(self, node: Node) -> str:
return self.get_byte_range(node).decode()

def get_address(self, node: Node):
def get_address(self, node: Node) -> FileOffsetRangeAddress:
return FileOffsetRangeAddress(node.start_byte, node.end_byte)

def get_default_address(self):
def get_default_address(self) -> FileOffsetRangeAddress:
return self.get_address(self.tree.root_node)


class TreeSitterExtractorEngine(TreeSitterBaseEngine):
query: ScriptQueryBinding
import_signatures: set
buf_offset: int
namespaces: set[str]

def __init__(self, language: str, buf: bytes, buf_offset: int = 0, additional_namespaces: set[str] = None):
super().__init__(language, buf)
self.buf_offset = buf_offset
self.import_signatures = capa.features.extractors.ts.sig.load_import_signatures(language)
self.namespaces = additional_namespaces if additional_namespaces is not None else set()

def __init__(self, language: str, path: str):
super().__init__(language, path)
def get_address(self, node: Node) -> FileOffsetRangeAddress:
return FileOffsetRangeAddress(self.buf_offset + node.start_byte, self.buf_offset + node.end_byte)

def get_new_objects(self, node: Node) -> List[Tuple[Node, str]]:
return self.query.new_object.captures(node)
@@ -73,13 +75,13 @@ def get_new_object_ids(self, node: Node) -> Iterator[Node]:
# TODO: move this elsewhere, does not fit this class
def get_import_names(self, node: Node) -> Iterator[Tuple[Node, str]]:
join_names = capa.features.extractors.ts.sig.get_name_joiner(self.language)
namespaces = set([self.get_range(ns_node) for ns_node, _ in self.get_namespaces()])
self.namespaces = self.namespaces.union(set([self.get_range(ns_node) for ns_node, _ in self.get_namespaces()]))
for obj_node in self.get_new_object_ids(node):
obj_name = self.get_range(obj_node)
if obj_name in self.import_signatures:
yield (obj_node, obj_name)
continue
for namespace in namespaces:
for namespace in self.namespaces:
obj_name = join_names(namespace, obj_name)
if obj_name in self.import_signatures:
yield (obj_node, obj_name)
@@ -107,13 +109,13 @@ def get_function_call_ids(self, node: Node) -> Iterator[Node]:
# TODO: move this elsewhere, does not fit this class
def get_function_names(self, node: Node) -> Iterator[Tuple[Node, str]]:
join_names = capa.features.extractors.ts.sig.get_name_joiner(self.language)
namespaces = set([self.get_range(ns_node) for ns_node, _ in self.get_namespaces()])
self.namespaces = self.namespaces.union(set([self.get_range(ns_node) for ns_node, _ in self.get_namespaces()]))
for fn_node in self.get_function_call_ids(node):
fn_name = self.get_range(fn_node)
if fn_name in self.import_signatures:
yield (fn_node, fn_name)
continue
for namespace in namespaces:
for namespace in self.namespaces:
fn_name = join_names(namespace, fn_name)
if fn_name in self.import_signatures:
yield (fn_node, fn_name)
@@ -131,65 +133,73 @@ def get_global_statements(self) -> List[Tuple[Node, str]]:
return self.query.global_statement.captures(self.tree.root_node)


@dataclass
class ASPXPseudoNode:
start_byte: int
end_byte: int


class TreeSitterTemplateEngine(TreeSitterBaseEngine):
query: TemplateQueryBinding

def __init__(self, language: str, path: str):
super().__init__(language, path)
def __init__(self, buf: bytes):
super().__init__(LANG_TEM, buf)

def get_code_sections(self) -> List[Tuple[Node, str]]:
return self.query.code.captures(self.tree.root_node)

def get_parsed_code_sections(self) -> Iterator[TreeSitterExtractorEngine]:
template_namespaces = set(name for _, name in self.get_template_namespaces())
for node, _ in self.get_code_sections():
yield TreeSitterExtractorEngine(
self.identify_language(), self.get_byte_range(node), node.start_byte, template_namespaces
)

def get_content_sections(self) -> List[Tuple[Node, str]]:
return self.query.content.captures(self.tree.root_node)

def get_template_namespaces(self) -> Iterator[ASPXPseudoNode]:
def identify_language(self) -> str:
for node, _ in self.get_code_sections():
if self.is_c_sharp(node):
return LANG_CS
return LANG_JS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if it is neither?

Copy link
Author

@adamstorek adamstorek Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is no easy way to remedy this. From what I understand about templates in general is that the syntax is determined by the templating engine. In other words, there is easy way to detect from an unknown template which templating engine is being used (asp.net (and if so, what language), razor, ejs, erb, mako, jinja2, django, cheetah, go's html/template etc., not to mention each has their own syntax (some might use regular programming languages like C# to embed server logic, some might just contain very rudimentary placeholders/logic.

Here I am assuming that we only support EJS and C# in ASPX at the moment as embedded templates. This is because Tree-sitter embedded templates parser can only parse EJS and ERB (and we are not interested in embedded Ruby at the moment as far as I'm concerned). What's more, the default language for ASPX is VB, therefore if anyone wants to use C#, they need to include a @ Page directive with a Language attribute (see: https://docs.microsoft.com/en-us/previous-versions/aspnet/k33801s3(v=vs.100), https://docs.microsoft.com/en-us/previous-versions/dotnet/netframework-4.0/ydy4x04a(v=vs.100)?redirectedfrom=MSDN, https://docs.microsoft.com/en-us/previous-versions/aspnet/fbdt8kk7(v=vs.100)).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still assume that it's JS whenever it's not CS?
Could raise an Exception instead or are there other safe-guards in place before we get here?


def get_template_namespaces(self) -> Iterator[Tuple[Node, str]]:
for node, _ in self.get_code_sections():
if self.is_aspx_import_directive:
ns = self.get_aspx_namespace(node)
if ns is not None:
yield ns
namespace = self.get_aspx_namespace(node)
if namespace is not None:
yield node, namespace

def is_aspx(self, node: Node) -> bool:
return self.get_byte_range(node).startswith(b"@")
def is_c_sharp(self, node: Node) -> bool:
return len(re.findall(r'@ .*Page Language\s*=\s*"C#".*'.encode(), self.get_byte_range(node))) > 0

def is_aspx_import_directive(self, node: Node) -> bool:
return self.get_byte_range(node).startswith(b"@ Import namespace=")

def get_aspx_namespace(self, node: Node) -> Union[ASPXPseudoNode, None]:
def get_aspx_namespace(self, node: Node) -> Optional[str]:
match = re.search(r'@ Import namespace="(.*?)"'.encode(), self.get_byte_range(node))
if match is None:
return None
return ASPXPseudoNode(node.start_byte + match.span()[0], node.start_byte + match.span()[1])
return match.group().decode() if match is not None else None


class TreeSitterHTMLEngine(TreeSitterBaseEngine):
query: HTMLQueryBinding
namespaces: set[str]

def __init__(self, language: str, path: str):
super().__init__(language, path)
def __init__(self, buf: bytes, additional_namespaces: set[str] = None):
super().__init__(LANG_HTML, buf)
self.namespaces = additional_namespaces if additional_namespaces is not None else set()

def get_scripts(self) -> List[Tuple[Node, str]]:
return self.query.script_element.captures(self.tree.root_node)

def get_attributes(self, node: Node) -> List[Tuple[Node, str]]:
return self.query.attribute.captures(self.tree.root_node)

def get_code_sections_by_language(self) -> Dict[str, List[Node]]:
code_sections = defaultdict(list)
def get_code_sections(self) -> Iterator[Node]:
for script_node, _ in self.get_scripts():
for attribute_node, _ in self.get_attributes(script_node):
script_language = self.identify_script_language(attribute_node)
code_sections[script_language].append(attribute_node)
return code_sections
yield attribute_node

def get_parsed_code_sections(self) -> Iterator[TreeSitterExtractorEngine]:
for node in self.get_code_sections():
yield TreeSitterExtractorEngine(self.identify_language(node), self.get_byte_range(node), node.start_byte)

def identify_script_language(self, node: Node) -> str:
def identify_language(self, node: Node) -> str:
if self.is_server_side_c_sharp(node):
return LANG_CS
return LANG_JS
59 changes: 50 additions & 9 deletions capa/features/extractors/ts/extractor.py
Original file line number Diff line number Diff line change
@@ -1,37 +1,78 @@
from typing import Tuple, Union, Iterator
from typing import List, Tuple, Union, Iterator

from tree_sitter import Node

import capa.features.extractors.script
import capa.features.extractors.ts.file
import capa.features.extractors.ts.engine
import capa.features.extractors.ts.global_
import capa.features.extractors.ts.function
from capa.features.address import NO_ADDRESS, Address, AbsoluteVirtualAddress
from capa.features.extractors.ts.engine import TreeSitterExtractorEngine
from capa.features.common import Namespace
from capa.features.address import NO_ADDRESS, Address, AbsoluteVirtualAddress, FileOffsetRangeAddress
from capa.features.extractors.script import LANG_TEM, LANG_HTML
from capa.features.extractors.ts.engine import TreeSitterHTMLEngine, TreeSitterTemplateEngine, TreeSitterExtractorEngine
from capa.features.extractors.base_extractor import Feature, BBHandle, InsnHandle, FunctionHandle, FeatureExtractor


class TreeSitterFeatureExtractor(FeatureExtractor):
engine: TreeSitterExtractorEngine
code_sections: List[TreeSitterExtractorEngine]
template_namespaces: set[Tuple[Node, str]]
language: str

def __init__(self, path: str):
super().__init__()
self.engine = TreeSitterExtractorEngine(capa.features.extractors.script.get_language_from_ext(path), path)
self.path = path
with open(self.path, "rb") as f:
buf = f.read()

self.language = capa.features.extractors.script.get_language_from_ext(path)
if self.language == LANG_TEM:
self.code_sections, self.template_namespaces = self.extract_code_from_template(buf)
elif self.language == LANG_HTML:
self.code_sections = list(self.extract_code_from_html(buf))
else:
self.code_sections = [TreeSitterExtractorEngine(self.language, buf)]

def extract_code_from_template(self, buf: bytes) -> Tuple[List[TreeSitterExtractorEngine], set[Tuple[Node, str]]]:
template_engine = TreeSitterTemplateEngine(buf)
template_namespaces = set(template_engine.get_template_namespaces())
code_sections = list(template_engine.get_parsed_code_sections())

additional_namespaces = set(name for _, name in template_namespaces)
for section in template_engine.get_content_sections():
section_buf = template_engine.get_byte_range(section)
code_sections.extend(list(self.extract_code_from_html(section_buf, additional_namespaces)))
return code_sections, template_namespaces

def extract_code_from_html(
self, buf: bytes, additional_namespaces: set[str] = None
) -> Iterator[TreeSitterExtractorEngine]:
yield from TreeSitterHTMLEngine(buf, additional_namespaces).get_parsed_code_sections()

def get_base_address(self) -> Union[AbsoluteVirtualAddress, capa.features.address._NoAddress]:
return NO_ADDRESS

def extract_template_namespaces(self) -> Iterator[Tuple[Feature, Address]]:
for node, name in self.template_namespaces:
yield Namespace(name), FileOffsetRangeAddress(node.start_byte, node.end_byte)

def extract_global_features(self) -> Iterator[Tuple[Feature, Address]]:
yield from capa.features.extractors.ts.global_.extract_features()

def extract_file_features(self) -> Iterator[Tuple[Feature, Address]]:
yield from capa.features.extractors.ts.file.extract_features(self.engine)
if self.language == LANG_TEM:
yield from self.extract_template_namespaces()
for engine in self.code_sections:
yield from capa.features.extractors.ts.file.extract_features(engine)

def get_functions(self) -> Iterator[FunctionHandle]:
for node, _ in self.engine.get_function_definitions():
yield FunctionHandle(address=self.engine.get_address(node), inner=node)
for engine in self.code_sections:
for node, _ in engine.get_function_definitions():
yield FunctionHandle(address=engine.get_address(node), inner=node)

def extract_function_features(self, f: FunctionHandle) -> Iterator[Tuple[Feature, Address]]:
yield from capa.features.extractors.ts.function.extract_features(f, self.engine)
for engine in self.code_sections:
yield from capa.features.extractors.ts.function.extract_features(f, engine)

def get_basic_blocks(self, f: FunctionHandle) -> Iterator[BBHandle]:
yield from []
5 changes: 0 additions & 5 deletions capa/features/extractors/ts/file.py
Original file line number Diff line number Diff line change
@@ -8,10 +8,6 @@
from capa.features.extractors.ts.engine import TreeSitterExtractorEngine


def extract_file_format(engine: TreeSitterExtractorEngine) -> Iterator[Tuple[Feature, Address]]:
yield from capa.features.extractors.script.extract_format()


def extract_language(engine: TreeSitterExtractorEngine) -> Iterator[Tuple[Feature, Address]]:
yield from capa.features.extractors.script.extract_language(engine.language, engine.get_default_address())

@@ -52,7 +48,6 @@ def extract_features(engine: TreeSitterExtractorEngine) -> Iterator[Tuple[Featur


FILE_HANDLERS = (
extract_file_format,
extract_file_function_names,
extract_file_import_names,
extract_file_integer_literals,
6 changes: 5 additions & 1 deletion capa/features/extractors/ts/global_.py
Original file line number Diff line number Diff line change
@@ -19,4 +19,8 @@ def extract_features() -> Iterator[Tuple[Feature, Address]]:
yield feature, addr


GLOBAL_HANDLERS = (extract_arch, extract_os)
def extract_file_format() -> Iterator[Tuple[Feature, Address]]:
yield from capa.features.extractors.script.extract_format()


GLOBAL_HANDLERS = (extract_arch, extract_os, extract_file_format)
8 changes: 5 additions & 3 deletions tests/fixtures.py
Original file line number Diff line number Diff line change
@@ -172,10 +172,10 @@ def get_dnfile_extractor(path):


@lru_cache(maxsize=1)
def get_ts_extractor_engine(language, path):
def get_ts_extractor_engine(language, buf):
import capa.features.extractors.ts.engine

return capa.features.extractors.ts.engine.TreeSitterExtractorEngine(language, path)
return capa.features.extractors.ts.engine.TreeSitterExtractorEngine(language, buf)


@lru_cache(maxsize=1)
@@ -963,4 +963,6 @@ def _692f_dotnetfile_extractor():

@pytest.fixture
def cs_f397cb_extractor_engine():
return get_ts_extractor_engine("c_sharp", get_data_path_by_name("cs_f397cb"))
with open(get_data_path_by_name("cs_f397cb"), "rb") as f:
buf = f.read()
return get_ts_extractor_engine("c_sharp", buf)
1 change: 0 additions & 1 deletion tests/test_ts.py
Original file line number Diff line number Diff line change
@@ -17,7 +17,6 @@ def do_test_ts_engine_init(engine: TreeSitterExtractorEngine):
assert engine.language == LANG_CS
assert isinstance(engine.query, QueryBinding)
assert isinstance(engine.import_signatures, set) and len(engine.import_signatures) > 0
assert isinstance(engine.path, str) and len(engine.path) > 0
assert isinstance(engine.buf, bytes) and len(engine.buf) > 0
assert isinstance(engine.tree, Tree)
assert isinstance(engine.get_default_address(), FileOffsetRangeAddress)