To expand on August’s answer, I figured I should explain exactly how this is compiled and what takes up space at the binary level.
To simplify, I’ll use the following example with shorter strings. The important part is that there are some duplicate strings.
public class StringArray
{
private static final String[] stringArray = {
"AAA",
"AAB",
"AAA",
"AAC",
"AAA"
};
}
public class LongString
{
private static final String longString =
"AAA"
+ "AAB"
+ "AAA"
+ "AAC"
+ "AAA"
;
}
Now there are three important things to understand when compiling this code.
- Constant string concatenation is done at compile time. In fact, this is a special case of compile time simplification of constant expressions. You can find the exact rules for what is considered a constant expression in the Java Language Specification.
- Array brace initializers are syntactic sugar. The code is equivalent to creating an array and assigning the elements one by one. (Note that this is specific to Java bytecode. Dalvik (i.e. Android) does have special shorthand instructions for array initialization)
- Inline initializers are syntactic sugar. The code is equivalent to manually initializing the fields in the static initializer method.
Edit: One minor detail is that in the special case of static final fields initialized to a constant expression inline, the field is initialized using the ConstantValue attribute instead of in the static initializer and all uses are inlined. So in the case of LongString
, #3 will actually result in different bytecode, but since the constant pool entries for the string are the same, the filesize taken up by the strings won’t change.
Put them together, and the above code is equivalent to the following.
public class StringArray {
private static final String[] stringArray;
static {
String[] temp = new String[5];
temp[0] = "AAA";
temp[1] = "AAB";
temp[2] = "AAA";
temp[3] = "AAC";
temp[4] = "AAA";
stringArray = temp;
}
}
public class LongString {
private static final String longString;
static {
longString = "AAAAABAAAAACAAA";
}
}
Now this desugared code still shows the duplicate strings multiple times in the array example. To understand the classfile size behavior, you have to understand what it is compiled to.
When you access a constant string, the bytecode contains a load constant instruction (ldc
or ldc_w
), which is a one byte opcode followed by an index into the class’s constant pool. The constant pool is a seperate section of the classfile where a list of constants is stored. Obviously, the compiler will only store each constant once.
So the bytecode for StringArray looks something like this (stripping away a few details that aren’t relevant here). Note that there are only 3 unique strings stored in the constant pool. (Actually there are a lot more constant pool entries relating to other parts of the classfile, but they’re not important here).
.class super StringArray
.field static final private stringArray [Ljava/lang/String;
.const [1] = String 'AAA'
.const [2] = String 'AAB'
.const [3] = String 'AAC'
.method static <clinit> : ()V
iconst_5
anewarray java/lang/String
astore_0
aload_0
iconst_0
ldc [1]
aastore
aload_0
iconst_1
ldc [2]
aastore
aload_0
iconst_2
ldc [1]
aastore
aload_0
iconst_3
ldc [3]
aastore
aload_0
iconst_4
ldc [1]
aastore
aload_0
putstatic StringArray stringArray [Ljava/lang/String;
return
.end method
Whereas LongString looks something like
.class super LongString
.field static final private longString Ljava/lang/String;
.const [1] = String 'AAAAABAAAAACAAA'
.method static <clinit> : ()V
ldc [1]
putstatic LongString longString Ljava/lang/String;
return
.end method
So with the first version, duplicate strings can be stored only once, whereas with the second version, the entire string has to be stored. So is the first version always better? Not so fast. It has the advantage of not storing duplicated strings, but as you may have noticed, there is a big per element overhead in the array initialization. Which one is better depends on how long your strings are and how many duplicates you have.
P.S. At the binary level, constant strings are encoded with a modified UTF8 encoding. The upshot is that characters 1-127 are one byte each but null characters are two bytes. So you can save some space by offsetting everything 1.