Basic terminology

Ramble

This article is the first of the author's learning about security static analysis, and will continue to update specific details such as taint analysis, data flow analysis, control flow analysis, as well as practical analysis of good open-source projects, and the introduction of specific implementation in DevSecOps.

One-sentence description

The process of detecting errors, vulnerabilities, security risks, and potential issues in the code by analyzing the structure, syntax, and semantics of the source code, as well as using static analysis tools.

1689587695_64b50fefc9c6cbba578cf.png!small?1689587696517

Basic terminology

Lexical (Lexical)

Refers to the process of decomposing source code into basic lexical units or tokens; the lexical analyzer (Lexical Analyzer) or lexer is responsible for performing the lexical analysis task.

Syntax

Used to convert source code into an abstract syntax tree (Abstract Syntax Tree, AST) or other intermediate representation forms for further analysis.

Semantics

Analyze the source code to capture its meaning, semantics, and language rules, and to detect possible errors, inconsistencies, and potential issues.

Abstract Syntax Tree (AST)

A commonly used data structure in programming language processing and static analysis, which displays the organization and grammatical relationship of the code in a tree-like structure

Intermediate Representation (IR)

A kind of intermediate form of code used in compilers or interpreters. It is an abstract representation after the source code has been syntax and semantic analysis, which can facilitate optimization, transformation, and generation of target code, independent of specific hardware platforms or programming languages, with relatively independent and general characteristics, making it easier for compilers or interpreters to optimize and support cross-platform

Three-address code

Used to represent one of the intermediate code forms of computer programs, where each instruction has at most three operands, and usually two operands are used for operations, and the result is stored in the third operand
Each instruction contains three fields: operator (operator), operand 1 (operand1), and operand 2 (operand2), and stores the calculation result in operand 3 (result)

t1 = a+b
t2 = t1 * cd = t2 - a

Static Single Assignment (SSA)

One of the intermediate code forms, where each variable is assigned only once in the program, and the new variable name is usually the original variable name plus a unique identifier, which is convenient for data flow analysis, optimization, and code generation.

x1 = 1
y1 = x1 + 2x2 = 3z1 = x2 * y1

Data Flow Graph (DFG)

Used to describe the data flow and data dependency relationship in the program
Nodes represent operations or calculations in the program, such as variable definitions, assignments, and operations. Edges represent the transmission of data, that is, the path of data flow. Each edge has a direction, indicating the direction of data flow
Used for analysis such as constant propagation, copy propagation, and live variable analysis

Control Flow Graph (CFG)

Used to describe the control flow in the program, that is, the execution order and conditional branches of the program
Nodes represent basic blocks in the program, with each basic block containing a series of sequentially executed statements. Edges represent the transfer of control flow, that is, the jump relationship between different basic blocks. Each edge has a condition, indicating the condition for the execution jump.
Used for analysis such as execution path, conditional branches, and loop structures

Basic Block

Refers to a continuous piece of code in the program, which has only one entry point and one exit point

Call Graph (CG)

Used to analyze and understand the function call flow of the program
Nodes represent functions, and edges represent the call relationship between functions
Used to understand the direct or indirect call relationship between functions in the program, track the call path of functions, understand the execution flow of the program, and analyze the dependency relationship between functions

Program Dependence Graph (PDG)

Used to analyze the dependency relationship between the data flow and control flow of the program
Nodes represent a statement in the program, while edges represent the dependency relationships between statements

System Dependence Graph (SDG)

Used to describe the dependency relationships between various components in the system, including software components, hardware devices, network connections, databases, etc.
Nodes represent a component in the system, while edges represent the dependency relationships between components

Code Property Graph (CPG)

Source code intermediate representation, which is the latest and most widely used source code graphic representation in the current source code vulnerability static analysis technology, and is merged from AST, CFG, and PDG

1689587900_64b510bc54fb7b6493436.png!small?1689587902117

Analysis Methods

Taint Analysis

Track the propagation and usage of sensitive data in the program to detect potential data leaks, injection attacks, or security vulnerabilities
Abstracted into a triplet<sources, sinks, sanitizers>in the form, where,sourceThat is, the taint source, which represents the direct introduction of untrusted data or confidential data into the system;sinkThat is, the taint sink, which represents the direct generation of security-sensitive operations (violation of data integrity) or leakage of privacy data to the outside world (violation of data confidentiality);sanitizerThat is, harmless processing, which means that the propagation of data no longer poses a threat to the information security of the software system through means such as data encryption or removal of harmful operations. Taint analysis is to analyze whether the data introduced by the taint source in the program can be directly propagated to the taint sink without harmless processing. If not, it means that the system is information flow secure; otherwise, it means that the system has produced security issues such as privacy data leakage or dangerous data operations.
Taint Analysis Simplifies the Processing Process

1689587914_64b510cac2c2264d18363.png!small?1689587916153

Explicit Flow Analysis
- - Analyze how the taint mark propagates between data dependencies between variables in the program
Implicit Flow Analysis
- - Analyze how the taint mark propagates between control dependencies between variables in the program
Harmless Processing
- After processing by this module, the data itself no longer carries sensitive information, or operations on the data will no longer pose a threat to the system
- For example, the input validation (input validation) module should be identified as a harmless processing module, XSS Auditor, CSRF Protect, etc.

Symbolic Execution

Used to automatically explore all possible execution paths of a program, by replacing specific input values in the program with symbolic values (Symbolic Value), then parsing these symbolic values through a constraint solver to generate input values that satisfy the program constraints, thereby executing different program paths
Simple Summary: The possible values of input points in all paths reaching a predetermined point during each analysis
Key Concepts and Steps
- Symbolic Input: Replace input values in the program with symbolic values, where each symbolic value represents a class of possible specific values. For example, use a symbolic variable x to replace specific input values.
- Symbolic Execution Path: Explore paths through symbolic values, determine different execution paths according to conditional statements in the program (such as if statements, loop conditions).
- Constraint Condition Generation: During the symbolic execution process, collect constraint conditions on the path, which describe the constraint relationships in the program path. For example, the condition of an if statement is a constraint condition.
- Constraint Solving: Use a constraint solver (Constraint Solver) to solve the collected constraints to obtain specific input values that satisfy the constraints.
- Path Coverage and Error Detection: Through symbolic execution, explore multiple paths, cover different program execution situations, and help find potential errors and vulnerabilities. For example, if unreachable code or error conditions are found on a path, there may be program errors.
Symbolic Execution Process

1689587952_64b510f0560bb8ca9446d.png!small?1689587953719

Pointer Analysis

Used to infer the pointing relationship of pointer variables in the program, that is, to determine the objects or addresses that pointer variables may point to
Help detect potential memory security issues, such as null pointer dereference, wild pointer reference, etc.

Analysis Terms

Inter-procedure Analysis (Inter-procedure Analysis)

Used to analyze the behavior and properties of cross-function or cross-process in programs. It can track the call relationships between functions, pass information from one function to another, and perform comprehensive analysis of the behavior of the entire program

Intraprocedure Analysis (Intraprocedure Analysis)

Used to analyze the behavior and properties of programs within a single function or process. It focuses on the data flow, control flow, and semantic structure within the function to identify errors, vulnerabilities, and potential issues within the function

Context-Sensitive (Context-Sensitive)

Distinguish the same function called at different call locations, infer program behavior and properties according to the specific context environment of the program, in order to more accurately analyze and understand the semantics and behavior of the program

Context-Insensitive (Context-Insensitive)

Treat each call or return as a 'goto' operation, ignoring the call location and function parameter values, etc., for quick detection of potential security vulnerabilities and preliminary risk assessment

Flow-Insensitive Analysis

Without considering the order of statements, analyze each statement in sequence from top to bottom according to the physical location of the program statements, ignoring the branches existing in the program

Flow-Sensitive Analysis