关于Apache Spark：如果满足条件，Pyspark将功能应用于列值

Pyspark apply function to column value if condition is met

本问题已经有最佳答案，请猛点这里访问。

给出pyspark数据框，例如：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

ls = [
['1', 2],
['2', 7],
['1', 3],
['2',-6],
['1', 3],
['1', 5],
['1', 4],
['2', 7]
]
df = spark.createDataFrame(pd.DataFrame(ls, columns=['col1', 'col2']))
df.show()

+----+-----+
|col1| col2|
+----+-----+
| 1| 2|
| 2| 7|
| 1| 3|
| 2| -6|
| 1| 3|
| 1| 5|
| 1| 4|
| 2| 7|
+----+-----+

如何将函数应用于col1 == \\'1 \\'的col2值并将结果存储在新列中？
例如，函数为：

f = x**2

结果应如下所示：

1
2
3
4
5
6
7
8
9
10
11
12

+----+-----+-----+
|col1| col2| y|
+----+-----+-----+
| 1| 2| 4|
| 2| 7| null|
| 1| 3| 9|
| 2| -6| null|
| 1| 3| 9|
| 1| 5| 25|
| 1| 4| 16|
| 2| 7| null|
+----+-----+-----+

我尝试定义一个单独的函数，并使用df.withColumn(y).when(condition，function)，但它不起作用。

那怎么做呢？

我希望这会有所帮助：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

def myFun(x):
return (x**2).cast(IntegerType())

df2 = df.withColumn("y", when(df.col1 == 1, myFun(df.col2)).otherwise(None))

df2.show()

+----+----+----+
|col1|col2| y|
+----+----+----+
| 1| 2| 4|
| 2| 7|null|
| 1| 3| 9|
| 2| -6|null|
| 1| 3| 9|
| 1| 5| 25|
| 1| 4| 16|
| 2| 7|null|
+----+----+----+