Java – add index columns to existing spark’s dataframe

I use java to run spark 1.5 I need to attach the ID / index column to the existing dataframe, for example:

+---------+--------+
|  surname|    name|
+---------+--------+
|    Green|    Jake|
| Anderson|  Thomas|
| Corleone| Michael|
|    Marsh|   Randy|
|  Montana|    Tony|
|    Green|   Julia|
|Brenneman|    Eady|
|   Durden|   Tyler|
| Corleone|    Vito|
|   Madiro|     Mat|
+---------+--------+

I want an index attached to each row, in the range between 1 and table records The index order does not matter. Any row must contain only a unique ID / index It can be done by converting to RDD and attaching the index row and conversion to the dataframe with modified structtype. However, if I understand correctly, this operation will consume a lot of resources for conversion, etc., and there must be another way The results must be as follows:

+---------+--------+---+
|  surname|    name| id|
+---------+--------+---+
|    Green|    Jake|  3|
| Anderson|  Thomas|  5|
| Corleone| Michael|  2|
|    Marsh|   Randy| 10|
|  Montana|    Tony|  7|
|    Green|   Julia|  1|
|Brenneman|    Eady|  2|
|   Durden|   Tyler|  9|
| Corleone|    Vito|  4|
|   Madiro|     Mat|  6|
+---------+--------+---+

thank you.

Solution

I know this problem may be a while away, but you can do this:

from pyspark.sql.window import Window  
w = Window.orderBy("myColumn") 
withIndexDF = originalDF.withColumn("index",row_number().over(w))

>Mycolumn: any specific column in the data frame. > Originaldf: original dataframe without index column

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>