Understand pyc files for identifying malicious programs

The bytecodes compiled from Python source codes

Description:

As recent papers show, there are hundreds of malicious code injections and malwares existing in the package managers. Furthermore, these vulnerable packages or files are downloaded by the end users or developers for many times, which causes serious security issues. Many applications are realized by the interpreted programming languages, like Python and Javascript, with a large number of packages and modules. However, there are insufficient discussions about malicious files and related detecting techniques in the area of Python according to the file format. In fact, compiling Python source into bytecode is a necessary intermediate step while translating instructions from source code in human-readable language into machine instructions that your operating system can execute. One of the most important Python bytecode types is pyc. Pyc files will be automatically generated by the interpreter if you import a module in the original Python source, and this operation can speed up the process of importing the module again. Pyc programs are compiled binaries, and it will be updated if the source code file is updated, which ensures that these bytecodes are up to date for execution. During the compilation, lots of information will be removed, so the binary code will not include these features. At the same time, some new information will be added, like the version of Python, the timestamp, and the size of binary code in bytes, etc. The new information and the format of the whole binary file are vital for us to detect whether this program is malicious.

Keywords: Program Analysis, System Security.

Malicious Cases:

  1. Embedding extra payload in bytecodes: If Python version is below 3.6, instructions in the bytecode occupy either 1 or 3 bytes (more 2 bytes for argument). In Python 3.6, this is changed so that all instructions occupy two bytes, if no argument, set the second byte to zero. However, if the second byte is set to zero, attackers can embed their own passwords or other information, since it is ignored during execution. Some popular tools, like Stegosaurus, can utilize the dead zone to hide some information.

  2. Destroying the pyc files with other instructions: Attackers can modify the timestamp, the version of Python or the size of the course file in the header to make the process of decompiling and disassembling useless. Several attackers may even replace some instructions in the code object to make it confusing.

  3. Obfuscation in bytecodes: Malware authors using Python have many libraries they could use to obfuscate their Python code to make code readability much more difficult. Furthermore, advanced attackers can add some useless instructions to confuse the developers, since they will not impact the execution of programs and can be disassembled. In most cases, they will make decompiling tools not work.

Parse error when decompling Python bytecodes

Methodology

There are four tools we use to detect the malware in the first step. To decompile these Python bytecodes, we utilize uncompyle6 and decomple3. Uncompyle6 can handle Python bytecodes from versions 1.4, 2.1-2.7, and 3.0-3.8 and later PyPy versions. However, there are some problems of control flow that cause bugs in higher Python version. Therefore, decomple3 works better when the Python version is above or equal to 3.7. To disassemble these Python bytecodes, we use the official dis module in Python, and python-xdis. The Python dis module allows to disassemble bytecodes from the same version of Python that you are running on. But python-xdis can handle bytecodes for virtually every release of Python and some releases of PyPy. According to different type of cases, we use different methods to analyze.

  1. First case: since the attackers will use the redundant space in the python bytecode file to hide the complete payload code into these fragmented spaces without changing the file size of the source file, decompiling and disassembling are not helpful in this case. We will detect the changes of hash and use some tools to identify the redundant space in the python bytecode to find whether new information is added.

  2. Second case: attackers usually modify the bytecode instructions or add extra bytecodes to make the file more complicated. We will firstly decompile the pyc files to locate the modified part, and then compare it with the compiled file from the original source file to understand what kind of attacks it is and flag it. If the pyc files cannot be decompiled and disassembled, they may be destroyed. And we will analyze the instruction or the header according to the errors

  3. Third case: because the obfuscation in bytecodes is difficult to figure out, we first try to decompile and disassemble the bytecodes. Obfuscated files will be disassembled successfully but decompiled wrongly. To discover where the wrong instructions are, we need to analyze the opcode instructions in the code object, to understand whether some instructions are useless but added or some instructions make the execution confusing. After these steps, we can design some algorithms to replace these wrong instructions and clean them out

Contribution:

Ever-increasing over the past decade, a large amount of malware has been written in interpreted languages, such as Python. The low barrier to entry, ease of use, rapid development process, and massive library collection has made Python attractive for millions of developers. Python is becoming more and more popular and easy to learn. As a result, Python will be utilized more in cyber-attacks, and more malicious codes will be discovered in the future. Therefore, we want to detect the malware and malicious codes in Python sources and files.

  • We analyze the malicious codes in Python from the perspective of Python bytecodes. Furthermore, we analyze the instructions in bytecodes and the file format to detect the malware.
  • We use new algorithms and tools to automatically find modified part in the bytecodes and distinguish the types of attacks.
  • To flag these malicious codes, we scan the whole code object and replace them with the same useless instructions. Therefore, we can remove them together.
Xin Liu
Xin Liu
Ph.D. Student

I am a second-year Ph.D. student in the Department of Computer Science at the University of Virginia. I have interests and skills in software system security, program analysis, differential privacy and machine learning.