Spark 的 cogroup 和 join 算子

cogroup 这个算子使用的频率很低,join 算子使用频率较高,两者都是根据两个 RDD 的 key 进行关联。具体看下面的代码,先看下面的 2 个 RDD:

SparkConf conf = new SparkConf()
                .setAppName("co")
                .setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);

        List<Tuple2<String, Integer>> words1 = Arrays.asList(
                new Tuple2<>("hello", 3),
                new Tuple2<>("hello", 2),
                new Tuple2<>("world", 7),
                new Tuple2<>("hello", 12),
                new Tuple2<>("you", 9)
        );

        List<Tuple2<String, Integer>> words2 = Arrays.asList(
                new Tuple2<>("hello", 21),
                new Tuple2<>("world", 24),
                new Tuple2<>("hello", 25),
                new Tuple2<>("you", 28)
        );

        JavaPairRDD<String, Integer> words1RDD = sc.parallelizePairs(words1);
        JavaPairRDD<String, Integer> words2RDD = sc.parallelizePairs(words2);

上面的 RDD 中,words1RDD 和 words2RDD 中的 key 都有重复的。然后看看看两者分别用 cogroup 和 join 算子的操作结果,先看 cogroup 的:

				int count = 1;

        JavaPairRDD<String, Tuple2<Iterable<Integer>, Iterable<Integer>>> cogroupRDD = words1RDD.cogroup(words2RDD);
        List<Tuple2<String, Tuple2<Iterable<Integer>, Iterable<Integer>>>> cogroupResult = cogroupRDD.collect();
        for (Tuple2<String, Tuple2<Iterable<Integer>, Iterable<Integer>>> t : cogroupResult){
            String word = t._1;
            Iterable<Integer> word1Counts = t._2._1;
            Iterable<Integer> word2Counts = t._2._2;

            String countInfo = "";
            for (Integer c1 : word1Counts) {
                countInfo += c1 + "(words1RDD),";
            }

            for (Integer c2 : word2Counts) {
                countInfo += c2 + "(words2RDD),";
            }

            System.out.println(String.format("第%s个元素为:%s -> %s", count, word, countInfo));

            count++;
        }

输出结果为:

1个元素为:you -> 9(words1RDD),28(words2RDD),2个元素为:hello -> 3(words1RDD),2(words1RDD),12(words1RDD),21(words2RDD),25(words2RDD),3个元素为:world -> 7(words1RDD),24(words2RDD),

再看 join 的:

JavaPairRDD<String, Tuple2<Integer, Integer>> joinedRDD = words1RDD.join(words2RDD);
        List<Tuple2<String, Tuple2<Integer, Integer>>> joinedResult = joinedRDD.collect();
        for (Tuple2<String, Tuple2<Integer, Integer>> t : joinedResult) {
            System.out.println(String.format("第%s个元素为:%s -> %s(words1RDD),%s(words2RDD)", count, t._1, t._2._1, t._2._2));
            count++;
        }

输出结果为:

1个元素为:you -> 9(words1RDD),28(words2RDD)2个元素为:hello -> 3(words1RDD),21(words2RDD)3个元素为:hello -> 3(words1RDD),25(words2RDD)4个元素为:hello -> 2(words1RDD),21(words2RDD)5个元素为:hello -> 2(words1RDD),25(words2RDD)6个元素为:hello -> 12(words1RDD),21(words2RDD)7个元素为:hello -> 12(words1RDD),25(words2RDD)8个元素为:world -> 7(words1RDD),24(words2RDD)

cogroup 算子计算过程会对相同的 key 做聚合操作,join 则不会。

发布了82 篇原创文章 · 获赞 148 · 访问量 13万+
展开阅读全文

spark 中rdd与dataframe的合并(join

06-22

以下是我写的代码: ``` /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ // scalastyle:off println package com.shine.ncc import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.mllib.classification.NaiveBayesModel import org.apache.spark.rdd.RDD import org.apache.spark.streaming.Time import org.apache.spark.sql.SQLContext import org.apache.spark.SparkContext import org.apache.spark.ml.feature.Tokenizer import org.ansj.splitWord.analysis.ToAnalysis import org.ansj.util.FilterModifWord import java.util.Arrays import org.apache.spark.mllib.feature.HashingTF import scala.collection.JavaConversions._ import org.apache.spark.mllib.feature.IDF import org.apache.spark.mllib.feature.IDFModel import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.client.HTable import org.apache.hadoop.hbase.client.Put import org.apache.hadoop.hbase.util.Bytes object NetworkNewsClassify1 { var sameModel = null /** Case class for converting RDD to DataFrame */ case class Record(content: String,time:String,title:String) /** Lazily instantiated singleton instance of SQLContext */ object SQLContextSingleton { @transient private var instance: SQLContext = _ def getInstance(sparkContext: SparkContext): SQLContext = { if (instance == null) { instance = new SQLContext(sparkContext) } instance } } def main(args: Array[String]) { // if (args.length < 2) { // System.err.println("Usage: NetworkWordCount <hostname> <port>") // System.exit(1) // } StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName("NetworkNewsClassify") sparkConf.setMaster("local[2]"); val ssc = new StreamingContext(sparkConf, Seconds(1)) // Create a socket stream on target ip:port and count the 获取json信息 val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER) val myNaiveBayesModel = NaiveBayesModel.load(ssc.sparkContext, "D:/myNaiveBayesModel") //将接送转换成rdd lines.foreachRDD((rdd: RDD[String], time: Time) => { // Get the singleton instance of SQLContext val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext) import sqlContext.implicits._ val newsDF = sqlContext.read.json(rdd) newsDF.count(); val featurizedData = newsDF.map{ line => val temp = ToAnalysis.parse(line.getAs("title")) //加入停用词 FilterModifWord.insertStopWords(Arrays.asList("r","n")) //加入停用词性???? FilterModifWord.insertStopNatures("w",null,"ns","r","u","e") val filter = FilterModifWord.modifResult(temp) //此步骤将会只取分词,不附带词性 val words = for(i<-Range(0,filter.size())) yield filter.get(i).getName //println(words.mkString(" ; ")); //计算每个词在文档中的词频 new HashingTF(500000).transform(words) }.cache() if(featurizedData.count()>0){ //计算每个词的TF-IDF val idf = new IDF() val idfModel = idf.fit(featurizedData) val tfidfData = idfModel.transform(featurizedData); //分类预测 val resultData = myNaiveBayesModel.predict(tfidfData) println(resultData) //将result结果与newsDF信息join在一起 //**??? 不会实现了。。。** //保存新闻到hbase中 } }) ssc.start() ssc.awaitTermination() } } ``` 其中newsDF是新闻信息,包含字段(title,body,date),resultData 是通过贝叶斯模型预测的新闻类型,我现在希望把result结果作为一个type字段与newsDF合并(join),保存到hbase中,这个合并的操作怎么做呢 问答

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 编程工作室 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览