Java – hibernate batch deletion vs single deletion
Editor: according to some of my debugging and logging records, I think this problem comes down to why delete from table where id = x is much faster than delete from table where id in (x), where x is only an ID
I recently tested batch deletion instead of deleting each row one by one, and noticed that batch deletion is much slower The table has delete, update and insert triggers, but I have tested and do not use triggers. Each batch deletion is slower Can anyone see why this is happening or share tips on how I can debug this? From my understanding, I can't really reduce the number of trigger activation, but I thought reducing the number of "delete" queries would help performance
I have listed some information below. If I don't have any relevant content, please let me know
The deletion is completed in batches, and the code is as follows:
private void batchDeletion( Collection<Long> ids ) { StringBuilder sb = new StringBuilder(); sb.append( "DELETE FROM ObjImpl WHERE id IN (:ids)" ); Query sql = getSession().createQuery( sb.toString() ); sql.setParameterList( "ids",ids ); sql.executeUpdate(); }
Basically delete only one line of code:
SessionFactory.getCurrentSession().delete(obj);
The table has two indexes that are not used for any deletion Cascading does not occur
The following is an example of explain analyze in the delete from table, where id in (1,2,3);:
Delete on table (cost=12.82..24.68 rows=3 width=6) (actual time=0.143..0.143 rows=0 loops=1) -> Bitmap Heap Scan on table (cost=12.82..24.68 rows=3 width=6) (actual time=0.138..0.138 rows=0 loops=1) Recheck Cond: (id = ANY ('{1,3}'::bigint[])) -> Bitmap Index Scan on pk_table (cost=0.00..12.82 rows=3 width=0) (actual time=0.114..0.114 rows=0 loops=1) Index Cond: (id = ANY ('{1,3}'::bigint[])) Total runtime: 3.926 ms
Every time I reload the data for testing, I have vacuumized and re established the index. My test data contains 386660 rows
The test is to delete all rows. I didn't use truncate because there is usually a selection criterion, but for test purposes, I have made the criterion include all rows After the trigger is enabled, 193616ms is deleted for each row one by one, while 285558ms is used for batch deletion Then I disabled the trigger and obtained 93793ms single line deletion and 181537ms batch deletion The trigger summarizes the values and updates another table - basically bookkeeping
I've played fewer batches (100 and 1) and they seem to perform worse
Edit: open hibernate log record and delete one line by one. It basically does this: delete id =? Interpretation and analysis:
Delete on table (cost=0.00..8.31 rows=1 width=6) (actual time=0.042..0.042 rows=0 loops=1) -> Index Scan using pk_table on table (cost=0.00..8.31 rows=1 width=6) (actual time=0.037..0.037 rows=0 loops=1) Index Cond: (id = 3874904) Total runtime: 0.130 ms
Editor: curious, if the list actually contains 10000 IDS, Postgres will do different things: No
Delete on table (cost=6842.01..138509.15 rows=9872 width=6) (actual time=17.170..17.170 rows=0 loops=1) -> Bitmap Heap Scan on table (cost=6842.01..138509.15 rows=9872 width=6) (actual time=17.160..17.160 rows=0 loops=1) Recheck Cond: (id = ANY ('{NUMBERS 1 THROUGH 10,000}'::bigint[])) -> Bitmap Index Scan on pk_table (cost=0.00..6839.54 rows=9872 width=0) (actual time=17.139..17.139 rows=0 loops=1) Index Cond: (id = ANY ('{NUMBERS 1 THROUGH 10,000}'::bigint[])) Total runtime: 17.391 ms
Edit: Based on the above explain analyze, I retrieved some records from the actual deletion operation The following are the records of two variants deleted line by line
Here are some individual deletions:
2013-03-14 13:09:25,424:delete from table where id=? 2013-03-14 13:09:25,424:delete from table where id=?
This is another variant of a single deletion (the list is only 1 item)
2013-03-14 13:49:59,858:delete from table where id in (?) 2013-03-14 13:50:01,460:delete from table where id in (?) 2013-03-14 13:50:03,040:delete from table where id in (?) 2013-03-14 13:50:04,544:delete from table where id in (?) 2013-03-14 13:50:06,125:delete from table where id in (?) 2013-03-14 13:50:07,707:delete from table where id in (?) 2013-03-14 13:50:09,275:delete from table where id in (?) 2013-03-14 13:50:10,833:delete from table where id in (?) 2013-03-14 13:50:12,369:delete from table where id in (?) 2013-03-14 13:50:13,873:delete from table where id in (?)
Both are IDs that exist in the table and should be sequential
Explain analyze from table where id of delete = 3774887;
Delete on table (cost=0.00..8.31 rows=1 width=6) (actual time=0.097..0.097 rows=0 loops=1) -> Index Scan using pk_table on table (cost=0.00..8.31 rows=1 width=6) (actual time=0.055..0.058 rows=1 loops=1) Index Cond: (id = 3774887) Total runtime: 0.162 ms
Explain analyze from table where id in (3774887) of delete;
Delete on table (cost=0.00..8.31 rows=1 width=6) (actual time=0.279..0.279 rows=0 loops=1) -> Index Scan using pk_table on table (cost=0.00..8.31 rows=1 width=6) (actual time=0.210..0.213 rows=1 loops=1) Index Cond: (id = 3774887) Total runtime: 0.452 ms
0.162 vs 0.452 considered a significant difference
Edit:
Set the batch size to 50000, and Hibernate doesn't like the idea:
java.lang.StackOverflowError at org.hibernate.hql.ast.util.NodeTraverser.visitDepthFirst(NodeTraverser.java:40) at org.hibernate.hql.ast.util.NodeTraverser.visitDepthFirst(NodeTraverser.java:41) at org.hibernate.hql.ast.util.NodeTraverser.visitDepthFirst(NodeTraverser.java:42) ....
Solution
Well, the first thing you must pay attention to is that SQL must be converted into a plan in some way Your explain results show that the logic here is basically different from the in (Vals) structure
WHERE id = 1;
It is converted into a simple equation filter
WHERE id IN (1);
Converted to array matching:
WHERE id = ANY(ARRAY[1]);
Obviously, the planner is not smart enough to notice that these mathematics have only one member on the array So what it's doing is planning arrays of any size, which is why you get nested circular bitmap index scans
What's interesting here is not only the slowness, but also the performance is much better Therefore, there is a member in the in () clause, and its speed is 40 times, while the speed of 10000 members is only 170 times, which also means that 10000 member versions are faster than 10000 IDs
So what happens here is that when a large number of IDS are checked, the planner is choosing a better plan, but only a few are worse