Java – add index columns to existing spark’s dataframe
•
Java
I use java to run spark 1.5 I need to attach the ID / index column to the existing dataframe, for example:
+---------+--------+ | surname| name| +---------+--------+ | Green| Jake| | Anderson| Thomas| | Corleone| Michael| | Marsh| Randy| | Montana| Tony| | Green| Julia| |Brenneman| Eady| | Durden| Tyler| | Corleone| Vito| | Madiro| Mat| +---------+--------+
I want an index attached to each row, in the range between 1 and table records The index order does not matter. Any row must contain only a unique ID / index It can be done by converting to RDD and attaching the index row and conversion to the dataframe with modified structtype. However, if I understand correctly, this operation will consume a lot of resources for conversion, etc., and there must be another way The results must be as follows:
+---------+--------+---+ | surname| name| id| +---------+--------+---+ | Green| Jake| 3| | Anderson| Thomas| 5| | Corleone| Michael| 2| | Marsh| Randy| 10| | Montana| Tony| 7| | Green| Julia| 1| |Brenneman| Eady| 2| | Durden| Tyler| 9| | Corleone| Vito| 4| | Madiro| Mat| 6| +---------+--------+---+
thank you.
Solution
I know this problem may be a while away, but you can do this:
from pyspark.sql.window import Window w = Window.orderBy("myColumn") withIndexDF = originalDF.withColumn("index",row_number().over(w))
>Mycolumn: any specific column in the data frame. > Originaldf: original dataframe without index column
The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
二维码