Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9999

Dataset API on top of Catalyst/DataFrame

    XMLWordPrintableJSON

Details

    • Story
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • SQL
    • None

    Description

      The RDD API is very flexible, and as a result harder to optimize its execution in some cases. The DataFrame API, on the other hand, is much easier to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to use UDFs, lack of strong types in Scala/Java).

      The goal of Spark Datasets is to provide an API that allows users to easily express transformations on domain objects, while also providing the performance and robustness advantages of the Spark SQL execution engine.

      Requirements

      • Fast - In most cases, the performance of Datasets should be equal to or better than working with RDDs. Encoders should be as fast or faster than Kryo and Java serialization, and unnecessary conversion should be avoided.
      • Typesafe - Similar to RDDs, objects and functions that operate on those objects should provide compile-time safety where possible. When converting from data where the schema is not known at compile-time (for example data read from an external source such as JSON), the conversion function should fail-fast if there is a schema mismatch.
      • Support for a variety of object models - Default encoders should be provided for a variety of object models: primitive types, case classes, tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard conventions, such as Avro SpecificRecords, should also work out of the box.
      • Java Compatible - Datasets should provide a single API that works in both Scala and Java. Where possible, shared types like Array will be used in the API. Where not possible, overloaded functions should be provided for both languages. Scala concepts, such as ClassTags should not be required in the user-facing API.
      • Interoperates with DataFrames - Users should be able to seamlessly transition between Datasets and DataFrames, without specifying conversion boiler-plate. When names used in the input schema line-up with fields in the given class, no extra mapping should be necessary. Libraries like MLlib should not need to provide different interfaces for accepting DataFrames and Datasets as input.

      For a detailed outline of the complete proposed API: marmbrus/dataset-api
      For an initial discussion of the design considerations in this API: design doc

      The initial version of the Dataset API has been merged in Spark 1.6. However, it will take a few more future releases to flush everything out.

      Attachments

        Issue Links

          1.
          Inital code generated encoder for product types Sub-task Resolved Michael Armbrust
          2.
          Initial code generated construction of Product classes from InternalRow Sub-task Resolved Michael Armbrust
          3.
          Initial API Draft Sub-task Resolved Michael Armbrust
          4.
          add encoder/decoder for external row Sub-task Resolved Wenchen Fan
          5.
          Java API support & test cases Sub-task Resolved Wenchen Fan
          6.
          Implement cogroup Sub-task Resolved Wenchen Fan
          7.
          Support for joining two datasets, returning a tuple of objects Sub-task Resolved Michael Armbrust
          8.
          GroupedIterator's hasNext is not idempotent Sub-task Resolved Unassigned
          9.
          groupBy on column expressions Sub-task Resolved Michael Armbrust
          10.
          Typed-safe aggregations Sub-task Resolved Michael Armbrust
          11.
          add map/flatMap to GroupedDataset Sub-task Resolved Wenchen Fan
          12.
          User facing api for typed aggregation Sub-task Resolved Michael Armbrust
          13.
          Improve toString Function Sub-task Resolved Michael Armbrust
          14.
          add java test for typed aggregate Sub-task Resolved Wenchen Fan
          15.
          Support as on Classes defined in the REPL Sub-task Resolved Michael Armbrust
          16.
          add reduce to GroupedDataset Sub-task Resolved Wenchen Fan
          17.
          support typed aggregate in project list Sub-task Resolved Wenchen Fan
          18.
          split ExpressionEncoder into FlatEncoder and ProductEncoder Sub-task Resolved Wenchen Fan
          19.
          org.apache.spark.sql.AnalysisException: Can't extract value from a#12 Sub-task Resolved Wenchen Fan
          20.
          collect, first, and take should use encoders for serialization Sub-task Resolved Reynold Xin
          21.
          Kryo-based encoder for opaque types Sub-task Resolved Reynold Xin
          22.
          Dataset self join returns incorrect result Sub-task Resolved Wenchen Fan
          23.
          Java-based encoder for opaque types Sub-task Resolved Reynold Xin
          24.
          nice error message for missing encoder Sub-task Resolved Wenchen Fan
          25.
          Add Java tests for Kryo/Java encoders Sub-task Resolved Reynold Xin
          26.
          add type cast if the real type is different but compatible with encoder schema Sub-task Resolved Wenchen Fan
          27.
          Incorrect results are returned when using null Sub-task Resolved Wenchen Fan
          28.
          support typed aggregate for complex buffer schema Sub-task Resolved Wenchen Fan
          29.
          fix `nullable` of encoder schema Sub-task Resolved Wenchen Fan
          30.
          fix encoder life cycle for CoGroup Sub-task Resolved Wenchen Fan
          31.
          Encoder for JavaBeans / POJOs Sub-task Resolved Wenchen Fan
          32.
          [SQL] Adding joinType into joinWith Sub-task Closed Unassigned
          33.
          Add missing APIs in Dataset Sub-task Resolved Xiao Li
          34.
          [SQL] Support Persist/Cache and Unpersist in Dataset APIs Sub-task Resolved Xiao Li
          35.
          refactor MapObjects to make it less hacky Sub-task Resolved Wenchen Fan
          36.
          WrapOption should not have type constraint for child Sub-task Resolved Apache Spark
          37.
          throw exception if the number of fields does not line up for Tuple encoder Sub-task Resolved Wenchen Fan
          38.
          use true as default value for propagateNull in NewInstance Sub-task Resolved Apache Spark
          39.
          Add BINARY to Encoders Sub-task Resolved Apache Spark
          40.
          Eliminate serialization for back to back operations Sub-task Resolved Michael Armbrust
          41.
          Move encoder definition into Aggregator interface Sub-task Resolved Reynold Xin
          42.
          Explicit APIs in Scala for specifying encoders Sub-task Resolved Reynold Xin

          Activity

            People

              marmbrus Michael Armbrust
              rxin Reynold Xin
              Votes:
              4 Vote for this issue
              Watchers:
              65 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: