Apache Spark

Software Screenshot:
Apache Spark
Software Details:
Version: 1.3.1 updated
Upload Date: 12 May 15
Distribution Type: Freeware
Downloads: 45

Rating: 5.0/5 (Total Votes: 1)

Spark was designed to improve processing speeds for data analysis and manipulation programs.

It was written in Java and Scala and provides features not found in other systems, mostly because they're not mainstream nor that useful for non-data processing applications.

What is new in this release:

  • The core API now supports multi-level aggregation trees to help speed up expensive reduce operations.
  • Improved error reporting has been added for certain gotcha operations.
  • Spark's Jetty dependency is now shaded to help avoid conflicts with user programs.
  • Spark now supports SSL encryption for some communication endpoints.
  • Realtime GC metrics and record counts have been added to the UI.

What is new in version 1.3.0:

  • The core API now supports multi-level aggregation trees to help speed up expensive reduce operations.
  • Improved error reporting has been added for certain gotcha operations.
  • Spark's Jetty dependency is now shaded to help avoid conflicts with user programs.
  • Spark now supports SSL encryption for some communication endpoints.
  • Realtime GC metrics and record counts have been added to the UI.

What is new in version 1.2.1:

  • PySpark's sort operator now supports external spilling for large datasets.
  • PySpark now supports broadcast variables larger than 2GB and performs external spilling during sorts.
  • Spark adds a job-level progress page in the Spark UI, a stable API for progress reporting, and dynamic updating of output metrics as jobs complete.
  • Spark now has support for reading binary files for images and other binary formats.

What is new in version 1.0.0:

  • This release expands Spark's standard libraries, introducing a new SQL package (Spark SQL) that lets users integrate SQL queries into existing Spark workflows.
  • MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms.

What is new in version 0.9.1:

  • Fixed hash collision bug in external spilling
  • Fixed conflict with Spark's log4j for users relying on other logging backends
  • Fixed Graphx missing from Spark assembly jar in maven builds
  • Fixed silent failures due to map output status exceeding Akka frame size
  • Removed Spark's unnecessary direct dependency on ASM
  • Removed metrics-ganglia from default build due to LGPL license conflict
  • Fixed bug in distribution tarball not containing spark assembly jar

What is new in version 0.8.0:

  • Development has moved to the Apache Sowftware Foundation as an incubator project.

What is new in version 0.7.3:

  • Python performance: Spark's mechanism for spawning Python VMs has been improved to do so faster when the JVM has a large heap size, speeding up the Python API.
  • Mesos fixes: JARs added to your job will now be on the classpath when deserializing task results in Mesos.
  • Error reporting: Better error reporting for non-serializable exceptions and overly large task results.
  • Examples: Added an example of stateful stream processing with updateStateByKey.
  • Build: Spark Streaming no longer depends on the Twitter4J repo, which should allow it to build in China.
  • Bug fixes in foldByKey, streaming count, statistics methods, documentation, and web UI.

What is new in version 0.7.2:

  • Scala version updated to 2.9.3.
  • Several improvements to Bagel, including performance fixes and a configurable storage level.
  • New API methods: subtractByKey, foldByKey, mapWith, filterWith, foreachPartition, and others.
  • A new metrics reporting interface, SparkListener, to collect information about each computation stage: task lengths, bytes shuffled, etc.
  • Several new examples using the Java API, including K-means and computing pi.

What is new in version 0.7.0:

  • Spark 0.7 adds a Python API called PySpark.
  • Spark jobs now launch a web dashboard for monitoring the memory usage of each distributed dataset (RDD) in the program.
  • Spark can now be built using Maven in addition to SBT.

What is new in version 0.6.1:

  • Fixed overly aggressive message timeouts that could cause workers to disconnect from the cluster.
  • Fixed a bug in the standalone deploy mode that did not expose hostnames to scheduler, affecting HDFS locality.
  • Improved connection reuse in shuffle, which can greatly speed up small shuffles.
  • Fixed some potential deadlocks in the block manager.
  • Fixed a bug getting IDs of failed hosts from Mesos.
  • Several EC2 script improvements, like better handling of spot instances.
  • Made the local IP address that Spark binds to customizable.
  • Support for Hadoop 2 distributions.
  • Support for locating Scala on Debian distributions.

What is new in version 0.6.0:

  • Simpler deployment.
  • Spark's documentation has been expanded with a new quick start guide, additional deployment instructions, configuration guide, tuning guide, and improved Scaladoc API documentation.
  • A new communication manager using asynchronous Java NIO lets shuffle operations run faster, especially when sending large amounts of data or when jobs have many tasks.
  • A new storage manager supports per-dataset storage level settings (e.g. whether to keep the dataset in memory, deserialized, on disk, etc, or even replicated across nodes).
  • Enhanced debugging.

Similar Software

http_logger
http_logger

13 Apr 15

Hostkit
Hostkit

13 May 15

Comments to Apache Spark

Comments not found
Add Comment
Turn on images!