Skip to content
Garren's [Big] Data Blog

Month: April 2015

Split file by keys

Posted by Garren on 2015/04/02

Files sometimes come in (whether via hadoop or other processes) as big globs of data with inter-related parts. Many times I want to process these globs concurrently but see my dilemma unfolding quickly. I could a) write the code to process it serially and be done with it in 1 hour or b) write code… Continue reading→


Categories
Default
Leave a Comment on Split file by keys

Recent Posts

  • Protected: Training 2020-02-27
  • Protected: Training 2020-01-31
  • Databricks + Snowflake: Catalyzing Data and AI Initiatives
  • Avoiding Performance Potholes: Scaling Python for Data Science using Spark @ Spark + AI Summit
  • Real-Time Decision Engine using Spark Structured Streaming + ML

Recent Comments

  • Mohammad on Big data [Spark] and its small files problem
  • Garren on Big data [Spark] and its small files problem
  • Mohammad on Big data [Spark] and its small files problem
  • Garren on Converting CSVs with Headers to AVRO
  • Manpreet Shuann on Converting CSVs with Headers to AVRO

Archives

  • February 2020
  • January 2020
  • April 2019
  • June 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • November 2017
  • October 2017
  • July 2017
  • June 2017
  • April 2017
  • October 2016
  • November 2015
  • April 2015
  • March 2015
  • January 2015
  • December 2014
  • October 2014
  • July 2014

Categories

  • Apache Spark
  • Default

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Twitter
LinkedIn
Stack Overflow
GitHub

Kafal Powered by WordPress