Java – query two related tables (joins)
This is the first table in hive – it contains information about the items we are buying
CREATE EXTERNAL TABLE IF NOT EXISTS Table1 (This is the MAIN table through which comparisons need to be made) ( ITEM_ID BIGINT,CREATED_TIME STRING,BUYER_ID BIGINT )
This is the data in the first table above
**ITEM_ID** **CREATED_TIME** **BUYER_ID** 220003038067 2012-06-21 1015826235 300003861266 2012-06-21 1015826235 140002997245 2012-06-14 1015826235 200002448035 2012-06-08 1015826235 260003553381 2012-06-07 1015826235
This is the second table in hive – it also contains information about the items we are buying
CREATE EXTERNAL TABLE IF NOT EXISTS Table2 ( USER_ID BIGINT,PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>> )
This is the data in the above table –
**USER_ID** **PURCHASED_ITEM** 1015826235 [{"product_id":220003038067,"timestamps":"1340321132000"},{"product_id":300003861266,"timestamps":"1340271857000"},{"product_id":140002997245,"timestamps":"1339694926000"},{"product_id":200002448035,"timestamps":"1339172659000"},{"product_id":260003553381,"timestamps":"1339072514000"}]
I reduced the data to only one buyer_ ID (user_id) to make the problem easy to understand
Problem statement-
I need to compare table2 with table1, which means I need to look at the user of table2_ Buyer ID and table1_ The ID (because they are all the same) is matched, and then it is purchased in Table 2_ Item, it is product_ The array of ID (the same as item_id) and timesamps (the same as created_time) should be the same as that in Table 1 for this specific user_ Item ID (buyer_id)_ ID and created_ The time is the same, and sometimes they (representing purchased_item and item_id, created_time) may be different, or some products are missing in Table 2_ ID and timestamp are compared in Table 1
This means that product in Table 2_ The counts of ID and timestamps should be the same as those in Table 1 for a specific buyer_ Item of ID (user_id)_ ID and created_ The time count should be the same and the content should be the same If they are not the same or are missing entries from table 2, the results need to be printed, and this specific item is missing from table 2_ ID and created_ Time, or after comparison with table 1, product_ ID and timestamp are different
So, for example, table 1 is currently the buyer_ ID 1015826235 I have 5 items_ ID and 5 created_ Time, so in Table 2, I should have the same user of a row_ 5 products with ID (buyer_id)_ The ID and five timestamps are exactly the same as table 1 If it is different or the input is missing, I need to print to show that this is the missing result, or the data is wrong
So just to make it clearer –
PURCHASED_ Item is a struct array in Table 2, which contains two things product_ ID and timestamps
If user_ ID and buyer_ If the ID matches, the product in Table 2_ The ID should be the same as the item in Table 1_ ID matches, the timestamps in Table 2 should match the created in Table 1_ Time match
to update
Hiveql SQL query problem: –
Q 1) Find all USER_ID from Table2 whose PRODUCT_ID or TIMESTAMP are not same with ITEM_ID or CREATED_TIME after comparing with Table1 on BUYER_ID.
Query the first question I wrote Is the query correct?
A 1) select Table2.user_id from Table2 where Table1.user_id = Table2.buyer_id and (Table1.item_id <> Table2.product_id or UNIX_TIMESTAMP(Table1.created_time) <> Table2.timestamps) Q 2) Find the `BUYER_ID(USER_ID)` and as well as those `ITEM_ID` and `CREATED_TIME` which are missing from `Table2` after comparing from `Table1` on `BUYER_ID`. A 2) Not sure.
Solution
I suggest you not for your created_ Time and timestamp use the "string" data type because it is difficult Instead of using date or timestamp
For your question: I think the big problem here is to use strings alone!
I am an Oracle user, but there should be something like this in hive:
To_date({string},{Format})
As you used
UNIX_TIMESTAMP({string})
Another thing: when you have a structure, you will have an address field like this: table2 PURCHASED_ ITEM [{address}]. product_ ID instead of table2 product_ ID, this is unknown
Another suggestion:
Trunc({Date},{Format ex: 'SS' for sseconds})
When your created_ Time and your time_ Stamp is not exactly the same time tick (it may be the difference of 0.001 seconds because of the insertion time difference. If you insert now or sysdate for each), you'd better truncate the date to seconds or milliseconds or whatever you think is better
One more thing: use NVL () or convert the null value here, because if you have such a problem, you can also have null values in your table, which will cause your query problems, and the NVL () function will be converted to null
I hope this will help