Why does the java compiler generate strange local variables and stack mapping frames? How can they be used to reliably determine variable types?
With the help of ASM framework, I create a Java bytecode detection tool, which needs to determine and possibly change the local variable type of the method Soon I came across a simple case where variables and stack mapping nodes looked strange and didn't give me enough information about the variables being used:
public static void test() { List l = new ArrayList(); for (Object i : l) { int a = (int)i; } }
Give the following bytecode (from idea):
public static test()V L0 LINENUMBER 42 L0 NEW java/util/ArrayList DUP INVOKESPECIAL java/util/ArrayList.<init> ()V ASTORE 0 L1 LINENUMBER 43 L1 ALOAD 0 INVOKEINTERFACE java/util/List.iterator ()Ljava/util/Iterator; ASTORE 1 L2 FRAME APPEND [java/util/List java/util/Iterator] ALOAD 1 INVOKEINTERFACE java/util/Iterator.hasNext ()Z IFEQ L3 ALOAD 1 INVOKEINTERFACE java/util/Iterator.next ()Ljava/lang/Object; ASTORE 2 L4 LINENUMBER 44 L4 ALOAD 2 CHECKCAST java/lang/Integer INVOKEVIRTUAL java/lang/Integer.intValue ()I ISTORE 3 L5 LINENUMBER 45 L5 GOTO L2 L3 LINENUMBER 46 L3 FRAME CHOP 1 RETURN L6 LOCALVARIABLE i Ljava/lang/Object; L4 L5 2 LOCALVARIABLE l Ljava/util/List; L1 L6 0 MAXSTACK = 2 MAXLOCALS = 4
It can be seen that all four explicitly and implicitly defined variables occupy one time slot, four time slots are reserved, but only two are defined in a strange order (address 2 before address 0) and the "vulnerability" between them Later, use astore 1 to write the list iterator to this "vulnerability" without first declaring the type of this variable Only after this operation will the stack mapping box appear, but I don't know why only two variables are put in it, because more than two variables are used later Later, using iStore 3, int writes to the variable slot again without any declaration
At this point, it seems that I need to completely ignore the variable definition, infer all types by interpreting the bytecode, and run the simulation of the JVM stack
ASM expand attempted_ Frame option, but it is useless. Only change the type of single frame node to F_ New, the rest is still exactly the same as before
Anyone can explain why I see such a strange code. Is there any other choice besides writing my own JVM interpreter?
Conclusion, based on all answers (if I am wrong, please correct me again):
The variable definition is only used to match the source variable name / type to a specific variable slot accessed on a specific code line, which is obviously ignored by the JVM class validator and during code execution May be absent or does not match the actual bytecode
The variable slot is treated as another stack, although it is accessed through a 32-bit word index, and it can always overwrite its contents with different temporary values as long as you use load and store instructions of matching type
The stack frame node contains a list of variables assigned from the beginning of the variable frame to the last variable, which will be loaded in subsequent code without being stored first This allocation mapping is expected to be the same regardless of the execution path to reach its label They also contain mappings similar to operand stacks Their contents can be specified as an increment relative to the previous stack frame node
Variables that exist only in a linear code sequence will only appear in the stack frame node if there are variables with a longer lifetime assigned at a higher slot address
Solution
The short answer is that if you want to know the type of stack frame element at each code location, you really need to write some kind of interpreter. Although it is most of this work has already been done, it is still not enough to recover the source type of local variables. There is no general solution at all
As mentioned in other answers, attributes such as localvariabletable do help to restore the formal declaration of local variables. For example, when debugging, it only overwrites the variables existing in the source code (in fact, this is the decision of the compiler), which is not mandatory It is also not guaranteed to be correct. For example, the bytecode conversion tool may change the code without updating these debugging properties, but the JVM does not care when you do not debug
As mentioned in other answers, the stackmaptable property is only used to help bytecode verification, not to provide a formal declaration It will tell the stack frame status at the branch merge point, as long as the verification is required
Therefore, for linear code sequences without branches, the types of local variables and operand stack entries are determined only by reasoning, but these inferred types can not guarantee to match the officially declared types at all
To illustrate this problem, the following branchless code sequence produces the same bytecode:
CharSequence cs; cs = "hello"; cs = CharBuffer.allocate(20);
{ String s = "hello"; } { CharBuffer cb = CharBuffer.allocate(20); }
The compiler decides to reuse the slots of local variables for variables with disjunctive ranges, but all relevant compilers do so
For verification, only correctness is important, so when the value of type X is stored in the local variable slot, then read it and access the member y.somemember, X must be assignable to y regardless of the declared type of the local variable. In fact, it is a supertype of Z and X, but a subtype of Y
Without debugging properties, you might want to analyze subsequent usage to guess the actual type (I think most decompilers do), such as the following code
CharSequence cs; cs = "hello"; cs.charAt(0); cs = CharBuffer.allocate(20); cs.charAt(0);
Contains two invokeinterface charsequence The charat instruction indicates that the actual type of the variable may be charsequence instead of string or charbuffer, but the bytecode is still the same, for example,
{ String s = "hello"; ((CharSequence)s).charAt(0); } { CharBuffer cb = CharBuffer.allocate(20); ((CharSequence)cb).charAt(0); }
Because these type conversions only affect subsequent method calls, but do not generate bytecode instructions themselves, because these are extended conversions
Therefore, the declared source level variable type cannot be accurately recovered from the bytecode in the linear sequence, and the stackmap frame entry is useless Their purpose is to help verify the correctness of subsequent code (which can be implemented through different code paths). Therefore, it does not need to declare all existing elements It only needs to declare the elements existing before the merge point and actually use them after the merge point But it depends on whether the compiler has entries (and which ones) that the verifier doesn't actually need