Latest = filtered.reduceByKey(lambda v1,v2: v2 if v1", line 1, in įile "/home/hadoop/spark/python/pyspark/rdd.py", line 259, in collect Transformed = cleaned.map(lambda l: (l, (l,l)))įiltered = transformed.filter(lambda l: "45" in l) I wrote both a Scala and a Python version of an algorithm that should be able to do this in parallel, and then timed their performance on the same Spark cluster.ĭata = sc.textFile("s3://idtargeting-logs/hive-sh/my-track/")Ĭleaned = data.map(lambda l: l.split("\t")).filter(lambda l: len(l) = 5 and all(l)) So a simple sequential algorithm would be to traverse the logs and keep a latest-accessed time for each visitor, and then return the temporally first of those timestamps. I wish to extract the first visit that is (so far) the latest of that visitor. I have a rather large log of website visitors in compressed tab separated files, with each line consisting of five fields (_, timestamp, url, _, visitor). I too, could not really find much to go by, so eventually I decided to do some experimentation myself, and I am sharing my findings here so that others might use them and save themselves the trouble of repeating the effort. I needed the information before deciding on whether to use Python or Scala as the programming language for a larger component which ultimately would rely on Spark for computations. I was also looking for information on pros and cons of using Python or Scala for programming against Spark.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |