使用 SparkR 向数据框添加新列

一则或许对你有用的小广告

欢迎加入小哈的星球 ,你将获得:专属的项目实战 / Java 学习路线 / 一对一提问 / 学习打卡/ 赠书活动

目前,正在 星球 内带小伙伴们做第一个项目:全栈前后端分离博客项目,采用技术栈 Spring Boot + Mybatis Plus + Vue 3.x + Vite 4手把手,前端 + 后端全栈开发,从 0 到 1 讲解每个功能点开发步骤,1v1 答疑,陪伴式直到项目上线,目前已更新了 204 小节,累计 32w+ 字,讲解图:1416 张,还在持续爆肝中,后续还会上新更多项目,目标是将 Java 领域典型的项目都整上,如秒杀系统、在线商城、IM 即时通讯、权限管理等等,已有 870+ 小伙伴加入,欢迎点击围观

继续 使用 SparkR 探索土地登记处的开放数据集, 我想看看英国哪条道路在过去 20 年中的房地产销售量最大。

回顾一下,这就是数据框的样子:


 ./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0

> sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")

> head(sales) C0 C1 C2 C3 C4 C5 1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00 UB5 4PJ T N 2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD D N 3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00 W4 1DZ F N 4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH D N 5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU D N 6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF T N C6 C7 C8 C9 C10 C11 1 F 106 READING ROAD NORTHOLT NORTHOLT 2 F 58 ADAMS MEADOW ILMINSTER ILMINSTER 3 L 58 WHELLOCK ROAD LONDON 4 F 17 WESTGATE MORPETH MORPETH 5 F 4 MASON GARDENS WEST WINCH KING'S LYNN 6 F 5 WILD FLOWER WAY DITCHINGHAM BUNGAY C12 C13 C14 1 EALING GREATER LONDON A 2 SOUTH SOMERSET SOMERSET A 3 EALING GREATER LONDON A 4 CASTLE MORPETH NORTHUMBERLAND A 5 KING'S LYNN AND WEST NORFOLK NORFOLK A 6 SOUTH NORFOLK NORFOLK A

本文档 解释了存储在每个字段中的数据,对于此特定查询,我们对字段 C9-C12 感兴趣。计划是按这些字段对数据框进行分组,然后按频率降序排序。

当按多个字段分组时,最简单的方法往往是创建一个新字段,将它们连接起来,然后按该字段分组。

我从以下内容开始:


 ./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0

> sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")

> head(sales) C0 C1 C2 C3 C4 C5 1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00 UB5 4PJ T N 2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD D N 3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00 W4 1DZ F N 4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH D N 5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU D N 6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF T N C6 C7 C8 C9 C10 C11 1 F 106 READING ROAD NORTHOLT NORTHOLT 2 F 58 ADAMS MEADOW ILMINSTER ILMINSTER 3 L 58 WHELLOCK ROAD LONDON 4 F 17 WESTGATE MORPETH MORPETH 5 F 4 MASON GARDENS WEST WINCH KING'S LYNN 6 F 5 WILD FLOWER WAY DITCHINGHAM BUNGAY C12 C13 C14 1 EALING GREATER LONDON A 2 SOUTH SOMERSET SOMERSET A 3 EALING GREATER LONDON A 4 CASTLE MORPETH NORTHUMBERLAND A 5 KING'S LYNN AND WEST NORFOLK NORFOLK A 6 SOUTH NORFOLK NORFOLK A

没那么成功!接下来我更原始:


 ./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0

> sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")

> head(sales) C0 C1 C2 C3 C4 C5 1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00 UB5 4PJ T N 2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD D N 3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00 W4 1DZ F N 4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH D N 5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU D N 6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF T N C6 C7 C8 C9 C10 C11 1 F 106 READING ROAD NORTHOLT NORTHOLT 2 F 58 ADAMS MEADOW ILMINSTER ILMINSTER 3 L 58 WHELLOCK ROAD LONDON 4 F 17 WESTGATE MORPETH MORPETH 5 F 4 MASON GARDENS WEST WINCH KING'S LYNN 6 F 5 WILD FLOWER WAY DITCHINGHAM BUNGAY C12 C13 C14 1 EALING GREATER LONDON A 2 SOUTH SOMERSET SOMERSET A 3 EALING GREATER LONDON A 4 CASTLE MORPETH NORTHUMBERLAND A 5 KING'S LYNN AND WEST NORFOLK NORFOLK A 6 SOUTH NORFOLK NORFOLK A

至少编译了,但所有地址都是“NA”,这不是我们想要的。经过一番搜索后,我意识到有一个 concat 函数 可以用来完成这个任务:


 ./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0

> sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")

> head(sales) C0 C1 C2 C3 C4 C5 1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00 UB5 4PJ T N 2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD D N 3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00 W4 1DZ F N 4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH D N 5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU D N 6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF T N C6 C7 C8 C9 C10 C11 1 F 106 READING ROAD NORTHOLT NORTHOLT 2 F 58 ADAMS MEADOW ILMINSTER ILMINSTER 3 L 58 WHELLOCK ROAD LONDON 4 F 17 WESTGATE MORPETH MORPETH 5 F 4 MASON GARDENS WEST WINCH KING'S LYNN 6 F 5 WILD FLOWER WAY DITCHINGHAM BUNGAY C12 C13 C14 1 EALING GREATER LONDON A 2 SOUTH SOMERSET SOMERSET A 3 EALING GREATER LONDON A 4 CASTLE MORPETH NORTHUMBERLAND A 5 KING'S LYNN AND WEST NORFOLK NORFOLK A 6 SOUTH NORFOLK NORFOLK A

这还差不多!现在让我们看看哪些街道售出的房产最多:


 ./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0

> sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")

> head(sales) C0 C1 C2 C3 C4 C5 1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00 UB5 4PJ T N 2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD D N 3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00 W4 1DZ F N 4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH D N 5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU D N 6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF T N C6 C7 C8 C9 C10 C11 1 F 106 READING ROAD NORTHOLT NORTHOLT 2 F 58 ADAMS MEADOW ILMINSTER ILMINSTER 3 L 58 WHELLOCK ROAD LONDON 4 F 17 WESTGATE MORPETH MORPETH 5 F 4 MASON GARDENS WEST WINCH KING'S LYNN 6 F 5 WILD FLOWER WAY DITCHINGHAM BUNGAY C12 C13 C14 1 EALING GREATER LONDON A 2 SOUTH SOMERSET SOMERSET A 3 EALING GREATER LONDON A 4 CASTLE MORPETH NORTHUMBERLAND A 5 KING'S LYNN AND WEST NORFOLK NORFOLK A 6 SOUTH NORFOLK NORFOLK A

接下来我们将进一步深入研究数据,但那是另一篇文章的内容。