Get the distinct elements of each group by other field on a Spark 1.6 Dataframe
我正在尝试按Spark数据帧中的日期分组,并为每个分组计数一列的唯一值:
1 2 3 4 5 6 7 8 9 | test.json {"name":"Yin","address":1111111,"date":20151122045510} {"name":"Yin","address":1111111,"date":20151122045501} {"name":"Yln","address":1111111,"date":20151122045500} {"name":"Yun","address":1111112,"date":20151122065832} {"name":"Yan","address":1111113,"date":20160101003221} {"name":"Yin","address":1111111,"date":20160703045231} {"name":"Yin","address":1111114,"date":20150419134543} {"name":"Yen","address":1111115,"date":20151123174302} |
和代码:
1 2 3 4 5 6 7 8 9 | import pyspark.sql.funcions as func from pyspark.sql.types import TimestampType from datetime import datetime df_y = sqlContext.read.json("/user/test.json") udf_dt = func.udf(lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType()) df = df_y.withColumn('datetime', udf_dt(df_y.date)) df_g = df_y.groupby(func.hour(df_y.date)) df_g.count().distinct().show() |
使用pyspark的结果是
1 2 3 4 5 6 7 8 9 10 | df_y.groupby(df_y.name).count().distinct().show() +----+-----+ |name|count| +----+-----+ | Yan| 1| | Yun| 1| | Yin| 4| | Yen| 1| | Yln| 1| +----+-----+ |
我对pandas的期望是这样的:
1 2 3 4 5 6 7 8 9 | df = df_y.toPandas() df.groupby('name').address.nunique() Out[51]: name Yan 1 Yen 1 Yin 2 Yln 1 Yun 1 |
如何通过另一个字段(例如地址)获取每个组的唯一元素?
有一种方法可以使用函数
对每个组的不同元素进行此计数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | import pyspark.sql.functions as func from pyspark.sql.types import TimestampType from datetime import datetime df_y = sqlContext.read.json("/user/test.json") udf_dt = func.udf(lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType()) df = df_y.withColumn('datetime', udf_dt(df_y.date)) df_g = df_y.groupby(func.hour(df_y.date)) df_y.groupby(df_y.name).agg(func.countDistinct('address')).show() +----+--------------+ |name|count(address)| +----+--------------+ | Yan| 1| | Yun| 1| | Yin| 2| | Yen| 1| | Yln| 1| +----+--------------+ |
可在[此处]获得文档(https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html#countDistinct(org.apache.spark .sql.Column,org.apache.spark.sql.Column ...))。
对字段" _c1"进行分组的简洁直接答案,并计算字段" _c2"中的不同数量的值:
1 2 3 | import pyspark.sql.functions as F dg = df.groupBy("_c1").agg(F.countDistinct("_c2")) |