Understand pyc files for identifying malicious programs
Keywords: Program Analysis, System Security.
Embedding extra payload in bytecodes: If Python version is below 3.6, instructions in the bytecode occupy either 1 or 3 bytes (more 2 bytes for argument). In Python 3.6, this is changed so that all instructions occupy two bytes, if no argument, set the second byte to zero. However, if the second byte is set to zero, attackers can embed their own passwords or other information, since it is ignored during execution. Some popular tools, like Stegosaurus, can utilize the dead zone to hide some information.
Destroying the pyc files with other instructions: Attackers can modify the timestamp, the version of Python or the size of the course file in the header to make the process of decompiling and disassembling useless. Several attackers may even replace some instructions in the code object to make it confusing.
Obfuscation in bytecodes: Malware authors using Python have many libraries they could use to obfuscate their Python code to make code readability much more difficult. Furthermore, advanced attackers can add some useless instructions to confuse the developers, since they will not impact the execution of programs and can be disassembled. In most cases, they will make decompiling tools not work.
There are four tools we use to detect the malware in the first step. To decompile these Python bytecodes, we utilize uncompyle6 and decomple3. Uncompyle6 can handle Python bytecodes from versions 1.4, 2.1-2.7, and 3.0-3.8 and later PyPy versions. However, there are some problems of control flow that cause bugs in higher Python version. Therefore, decomple3 works better when the Python version is above or equal to 3.7. To disassemble these Python bytecodes, we use the official dis module in Python, and python-xdis. The Python dis module allows to disassemble bytecodes from the same version of Python that you are running on. But python-xdis can handle bytecodes for virtually every release of Python and some releases of PyPy. According to different type of cases, we use different methods to analyze.
First case： since the attackers will use the redundant space in the python bytecode file to hide the complete payload code into these fragmented spaces without changing the file size of the source file, decompiling and disassembling are not helpful in this case. We will detect the changes of hash and use some tools to identify the redundant space in the python bytecode to find whether new information is added.
Second case： attackers usually modify the bytecode instructions or add extra bytecodes to make the file more complicated. We will firstly decompile the pyc files to locate the modified part, and then compare it with the compiled file from the original source file to understand what kind of attacks it is and flag it. If the pyc files cannot be decompiled and disassembled, they may be destroyed. And we will analyze the instruction or the header according to the errors
Third case： because the obfuscation in bytecodes is difficult to figure out, we first try to decompile and disassemble the bytecodes. Obfuscated files will be disassembled successfully but decompiled wrongly. To discover where the wrong instructions are, we need to analyze the opcode instructions in the code object, to understand whether some instructions are useless but added or some instructions make the execution confusing. After these steps, we can design some algorithms to replace these wrong instructions and clean them out
Ever-increasing over the past decade, a large amount of malware has been written in interpreted languages, such as Python. The low barrier to entry, ease of use, rapid development process, and massive library collection has made Python attractive for millions of developers. Python is becoming more and more popular and easy to learn. As a result, Python will be utilized more in cyber-attacks, and more malicious codes will be discovered in the future. Therefore, we want to detect the malware and malicious codes in Python sources and files.
- We analyze the malicious codes in Python from the perspective of Python bytecodes. Furthermore, we analyze the instructions in bytecodes and the file format to detect the malware.
- We use new algorithms and tools to automatically find modified part in the bytecodes and distinguish the types of attacks.
- To flag these malicious codes, we scan the whole code object and replace them with the same useless instructions. Therefore, we can remove them together.