On Java collection (I): HashMap
1、 Overview
HashMap is probably the implementation of the map interface we most often use. Without much more, let's take a look at the comments of the HashMap class:
The following is the class relationship of HashMap:
For HashMap, we focus on six issues:
2、 Data structure of HashMap
1. Underlying implementation
Since HashMap is called this name, its implementation must be based on hash table. I have introduced hash table in data structure and algorithm (10): hash table. In short, hash table is a data structure combining array and linked list. With the help of hash algorithm, we can quickly obtain element subscripts to achieve efficient search.
For the underlying data structure of HashMap, we need to understand two member variables and an internal class:
Draw a picture to describe:
We know that there are two ways to solve hash conflicts in hash table: open address method and separated linked list method. It is obvious that the separated linked list method is used here, which is commonly known as zipper method.
When we store a key value pair, we will obtain the hash value corresponding to the key through the hash algorithm, and use the hash value to find the subscript of the location to be stored in the bucket. Sometimes different key accountants calculate the same hash value, that is, hash collision, and the node will be connected behind the first node to form a linked list. When searching, first calculate the hash value through the key to find the linked list, and then traverse the linked list to find the key. Therefore, if the hash collision is serious, the linked list will become very long and the search efficiency will be affected.
From this perspective, if the bucket array is large, the same hash algorithm can get more positions. In other words, the probability of hash collision is smaller, but too large bucket array will waste space. Therefore, the capacity expansion algorithm mentioned later is used to dynamically adjust the capacity.
2. When will it turn into red black tree? Why?
In addition, we know that in JDK7, the underlying implementation of HashMap is only array + linked list, and in jdk8, it becomes array + linked list + red black tree.
Red black tree is a complex tree structure. Here, we simply understand it as a binary tree with binary sort tree property and certain balanced binary tree property (absolute balance is not required to avoid frequent rotation).
We know that the nodes with hash collision will form a linked list in the bucket. Looking at the tree method treeifybin (), we can find that when there are more than 8 elements in the linked list and more than or equal to 64 elements in the set, it will be transformed into a red black tree, otherwise it will only be simply expanded. This is because at the same depth, the tree can store more elements than the linked list, and can ensure good insertion, deletion and search efficiency. When the number of elements is less than 6, it will return to the linked list.
So why choose the numbers 8 and 6?
3、 Constructor for HashMap
Let's take a look at the member variables that may be involved in HashMap Construction:
1. Construction method with initial capacity and load factor
The load factor mentioned here measures the use of space in a hash table.
public HashMap(int initialCapacity,float loadFactor) {
//初始容量必须大于0
if (initialCapacity < 0)
throw new IllegalArgumentException("Illegal initial capacity: " +
initialCapacity);
//初始容量不能大于最大值
if (initialCapacity > MAXIMUM_CAPACITY)
initialCapacity = MAXIMUM_CAPACITY;
//初始化加载因子
if (loadFactor <= 0 || Float.isNaN(loadFactor))
throw new IllegalArgumentException("Illegal load factor: " +
loadFactor);
this.loadFactor = loadFactor;
this.threshold = tableSizeFor(initialCapacity);
}
The tablesizefor() method called here is a bit operation. Its function is:
In other words, the initialization capacity must be the n-th power of 2, which is directly related to the requirement of how HashMap can efficiently add elements to the collection.
For specific analysis, please refer to the static tool methods hash () and tablesizefor () (4) of HashMap source code annotation.
Then we can see that the threshold is directly given after the initial capacity processing, instead of using initialCapacity directly. The reason for this is that at the beginning, the underlying container table of the map has not been initialized, and this operation is put on the first put. Therefore, when we add elements for the first time, we will initialize the container according to the specified initial size.
2. Construction method with only initial capacity
public HashMap(int initialCapacity) {
//直接调用 HashMap(int initialCapacity,float loadFactor)构造方法
this(initialCapacity,DEFAULT_LOAD_FACTOR);
}
3. Nonparametric construction method
public HashMap() {
//全部使用默认值
this.loadFactor = DEFAULT_LOAD_FACTOR;
}
4. There is a construction method of parameters of map type
public HashMap(Map<? extends K,? extends V> m) {
this.loadFactor = DEFAULT_LOAD_FACTOR;
putMapEntries(m,false);
}
The putmapentries () method is called here. We will talk about it in detail later. Now it is simply understood as creating a new map set based on an existing map set, which is a bit similar to arrays Copyof() method.
4、 Output of HashMap
We can see from the above that after the constructor is executed, the HashMap data storage space is not really opened up, but the space will not be allocated for the table until the first put.
1. Implementation of put operation
There is a put() method in HashMap:
public V put(K key,V value) {
return putVal(hash(key),key,value,false,true);
}
Its comments are described as follows:
In short, it has two functions:
We can see that this method is actually implemented by hash () and putval ().
2. Calculate bucket container subscript
The bucket container subscript is calculated in three steps: obtain the hash value, mix the high and low bits of XOR operation to obtain a new hash, and obtain the subscript by new hash and length and operation.
Let's look at the source code of the hash () method:
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
The hashcode () method here is a native method. The principle is to convert the memory address of the object into an integer to obtain the hash value of the object.
This method first calls a key The hashcode () method obtains the hash value of the key, and then XOR the hash value with the upper 16 bits of the hash value.
In the following putval() method, the module is taken through the following three lines of code:
//n为新桶数组长度
n = (tab = resize()).length;
//进行与运算取模
(n - 1) & hash
I saw a very vivid picture on the Internet:
Let's understand:
In order to more clearly illustrate that taking the power of length 2 helps to fully hash and avoid hash collision, here is a particularly obvious example:
When the capacity of HashMap is 16, its binary is 10000, and the binary of (n-1) is 01111. The calculation results of hash value are as follows:
In the above four cases, we can see that different hash values and (n-1) can get different values after bit operation, so that the added elements can be evenly distributed in different positions in the set to avoid hash collision.
Let's take a look at the case where the capacity of HashMap is not the nth power of 2. When the capacity is 10, the binary is 01010, and the binary of (n-1) is 01001. Add the same element to it, and the result is:
It can be seen that three different elements have gone through & operation and obtained the same result (01001), resulting in serious hash collision.
3. Add elements to bucket array
final V putVal(int hash,K key,V value,boolean onlyIfAbsent,boolean evict) {
Node<K,V>[] tab; Node<K,V> p; int n,i;
//判断键值对数组table[i]是否为空或为null,否则执行resize()进行扩容
if ((tab = table) == null || (n = tab.length) == 0)
n = (tab = resize()).length;
//判断插入位置是否为空,是就插入新节点
if ((p = tab[i = (n - 1) & hash]) == null)
tab[i] = newNode(hash,null);
else {
Node<K,V> e; K k;
//如果直接找到相同节点存在就直接覆盖
if (p.hash == hash &&
((k = p.key) == key || (key != null && key.equals(k))))
e = p;
//否则判断该链是否为红黑树
else if (p instanceof TreeNode)
e = ((TreeNode<K,V>)p).putTreeVal(this,tab,hash,value);
else {
//如果是链表,就遍历链表,并且记录遍历到的节点数
for (int binCount = 0; ; ++binCount) {
//找到尾节点
if ((e = p.next) == null) {
p.next = newNode(hash,null);
//插入后链表长度大于8就转换成红黑树
if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
treeifyBin(tab,hash);
break;
}
//如果在链表找到相同阶段就覆盖
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k))))
break;
p = e;
}
}
//将新值覆盖旧值,返回旧值
if (e != null) { // existing mapping for key
V oldValue = e.value;
if (!onlyIfAbsent || oldValue == null)
e.value = value;
afterNodeAccess(e);
return oldValue;
}
}
++modCount;
//如果超过最大容量就扩容
if (++size > threshold)
resize();
afterNodeInsertion(evict);
return null;
}
5、 Get of HashMap
1. Implementation of get operation
Let's look at the comments and source code of the get () method:
public V get(Object key) {
Node<K,V> e;
return (e = getNode(hash(key),key)) == null ? null : e.value;
}
We can see that the getNode () method is actually called:
final Node<K,V> getNode(int hash,Object key) {
Node<K,V> first,e; int n; K k;
//确保table不为空,并且计算得到的下标对应table的位置上有节点
if ((tab = table) != null && (n = tab.length) > 0 &&
(first = tab[(n - 1) & hash]) != null) {
//判断第一个节点是不是要找的key
if (first.hash == hash && // always check first node
((k = first.key) == key || (key != null && key.equals(k))))
return first;
//如果第一个节点就查找链表或者红黑树
if ((e = first.next) != null) {
//红黑树上查找
if (first instanceof TreeNode)
return ((TreeNode<K,V>)first).getTreeNode(hash,key);
do {
//链表上查找
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k))))
return e;
} while ((e = e.next) != null);
}
}
return null;
}
2. Why rewrite hashcode when rewriting equals?
First of all, this is not the case with the original hashcode and equals
Reviewing the above, we can see that both put () and get () have statements like this:
p.hash == hash && (key != null && key.equals(k))
When we try to add or find a key, the method will judge whether the hash value and the value are equal. Only when they are equal will we judge that the key is the key to be obtained. In other words, strictly speaking, the same key is not allowed in a HashMap.
When we use an object as a key, the uniqueness of the key can still be guaranteed according to the original hashcode and equals. However, when we override the equals method instead of the hashcode () method, the values may be equal, but the hash values may be different due to the unequal addresses, resulting in two identical keys.
Let's take an example:
We now have a class:
/**
* @Author:CreateSequence
* @Date:2020-08-14 16:15
* @Description:Student类,重写了equals方法
*/
public class Student {
String name;
Integer age;
/**
* 重写了equals方法
* @param obj
* @return
*/
@Override
public boolean equals(Object obj) {
Student student = (Student) obj;
return (this.name == student.name) && (this.age == student.age);
}
public Student(String name,Integer age) {
this.name = name;
this.age = age;
}
@Override
public String toString() {
return "Student{" +
"name='" + name + '\'' +
",age=" + age +
'}';
}
}
Then let's try:
public static void main( String[] args ) {
HashMap<Student,Integer> map = new HashMap(16);
Student student1 = new Student("小明",21);
map.put(student1,1);
Student student2 = new Student("小明",21);
System.out.println("这个key已经存在了吗?"+map.containsKey(student2));
System.out.println(map.get(student2));
}
//输出结果
这个key已经存在了吗?false
null
You can see that because the values obtained by hashcode () are different, they are regarded as different keys in the map.
After overriding the hashcode () method of the student class:
@Override
public int hashCode() {
return age;
}
The execution result becomes:
这个key已经存在了吗?true
1
It can be seen that it is necessary to rewrite both equals and hashcode.
reference resources:
6、 Capacity expansion of HashMap
1. Implementation of resize operation
Don't say much, let's go directly to the source code
final Node<K,V>[] resize() {
Node<K,V>[] oldTab = table;
//旧桶容量
int oldCap = (oldTab == null) ? 0 : oldTab.length;
//要扩容的大小
int oldThr = threshold;
int newCap,newThr = 0;
//如果桶已经配初始化过了
if (oldCap > 0) {
//如果扩容到极限大小,就不再继续扩容了
if (oldCap >= MAXIMUM_CAPACITY) {
threshold = Integer.MAX_VALUE;
return oldTab;
}
//否则就扩容到原来的两倍
else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
oldCap >= DEFAULT_INITIAL_CAPACITY)
newThr = oldThr << 1; // double threshold
}
//如果未初始化,并且指定了初始容量,则初始容量即为第一次扩容的目标大小
else if (oldThr > 0) // initial capacity was placed in threshold
newCap = oldThr;
//否则使用默认初始容量,并且根据默认初始容量和加载因子计算得到下次扩容大小
else { // zero initial threshold signifies using defaults
newCap = DEFAULT_INITIAL_CAPACITY;
newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
}
//重新下一次计算扩容大小
if (newThr == 0) {
float ft = (float)newCap * loadFactor;
newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
(int)ft : Integer.MAX_VALUE);
}
threshold = newThr;
//将旧容器复制到新容器中
@SuppressWarnings({"rawtypes","unchecked"})
Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
table = newTab;
if (oldTab != null) {
for (int j = 0; j < oldCap; ++j) {
Node<K,V> e;
//把每一个桶移动到新桶中
if ((e = oldTab[j]) != null) {
oldTab[j] = null;
if (e.next == null)
newTab[e.hash & (newCap - 1)] = e;
//判断是否为红黑树
else if (e instanceof TreeNode)
((TreeNode<K,V>)e).split(this,newTab,j,oldCap);
//否则就遍历链表
else { // preserve order
Node<K,V> loHead = null,loTail = null;
Node<K,V> hiHead = null,hiTail = null;
Node<K,V> next;
do {
next = e.next;
//哈希值与原长度进行与运算,如果多出来那一位是0,就保持原下标
if ((e.hash & oldCap) == 0) {
if (loTail == null)
loHead = e;
else
loTail.next = e;
loTail = e;
}
//如果多出来那一位是1,就移动到(原下标+原长度)对应的新位置
else {
if (hiTail == null)
hiHead = e;
else
hiTail.next = e;
hiTail = e;
}
} while ((e = next) != null);
//放回新桶的原位置
if (loTail != null) {
loTail.next = null;
newTab[j] = loHead;
}
//放回新位置
if (hiTail != null) {
hiTail.next = null;
newTab[j + oldCap] = hiHead;
}
}
}
}
}
return newTab;
}
2. Why does jdk8 not need re hashing?
We know that if the bucket array is expanded, the length of the array will change, and the subscripts calculated during put and get are different according to the length and hash. When JDK7 expands and moves the data of the old container, it will re hash to obtain the new index, which is optimized in jdk8.
Because the length of the bucket array is always a power of 2, it will be doubled after capacity expansion and converted to binary, which will be one more bit than the original. If we assume that the bucket array is n, there are:
n = 16 -> 10000; (n-1) - > 1111;
n = 32 -> 100000; (n-1) - > 11111;
n = 64 -> 1000000; (n-1) - > 111111;
n = 128 -> 10000000; (n-1) - > 1111111;
Let's give an example to verify, as shown in the following figure:
(a) When n = 16, key1 and key2 follow the binary subscript obtained by (n-1) and operation; (b) when n = 32 After capacity expansion, key1 and key2 follow the binary subscript obtained by (n-1) and operation.
We can see that key2 has entered one bit, and the extra bit is equivalent to 10000 more. To convert to decimal is to add 16 to the original basis, that is, to add the length of the original bucket array, which is reflected in the code, that is, newtab [J + oldcap] = hihead; This sentence code;
Now looking at key1, we can see that the index of key1 has not moved. Because the extra bit of key is 0, it is still 0 after the operation. The final subscript is the same as the original one.
So we can summarize:
The advantage of this is that there is no need to recalculate the hash value; Since the single digit from multiple hash values may be 0 or 1, it is possible that the elements originally on the same linked list can be moved to a new location after capacity expansion, which effectively alleviates hash collision.
7、 HashMap thread unsafe
We know that HashMap is thread unsafe, and the thread safe map collection is concurrenthashmap. In fact, the thread insecurity of HashMap is different in JDK7 and jdk8:
1. Dead cycle caused by JDK7 head insertion
In JDK7, the error occurs in the capacity expansion method transfer, and its code is as follows:
void transfer(Entry[] newTable,boolean rehash) {
int newCapacity = newTable.length;
for (Entry<K,V> e : table) {
//遍历链表,当前节点为e
while(null != e) {
//获取当前节点的下一个节点next
Entry<K,V> next = e.next;
//重新计算哈希值
if (rehash) {
e.hash = null == e.key ? 0 : hash(e.key);
}
int i = indexFor(e.hash,newCapacity
//头插法
e.next = newTable[i];
newTable[i] = e;
e = next;
}
}
}
From the code, we can see that after the expansion, the subscript of the element is recalculated, and the table element is moved and inserted into the new linked list by using the header insertion method.
for instance:
Suppose thread a and thread B expand the following set at the same time:
1. A executes first, and the time slice depletion is suspended before newtable [i] = E. at this time, e = 1, e.next = null, next = 2
2. Thread B performs array expansion. After the expansion, this is the case for thread A. at this time, next next = 1,e.next = null,next = 2:
3. Thread B suspends, and thread a continues to execute the code after newtable [i] = E. after execution, e = 2, next = 2, e.next = 1:
4. Thread a goes on the next cycle. Because e.next = 1, next = 1. The header insertion method inserts 2 into newtable [i]. After execution, e = 1, next = e.next = null:
5. Thread a executes the last cycle. At this time, e.next = 2 because e.next = newtable [i], and then newtable [i] = e, that is, 1 is inserted back to the position of newtable [i]:
At this time, the most dangerous thing happened: e = 1, e.next = 2, e.next Next = 1, indicating that 2 and 1 have formed a ring linked list:
After that, the head insertion of wireless cycle 1 and 2 will cause an dead cycle.
2. Jdk8 data coverage
DK7 also has this problem.
We know that the put () method will judge the insertion position when inserting. If two threads judge that the same position is empty, the data inserted first will be overwritten by the latter.