In this article, we will show you the easiest way to load data to Hadoop and query it through Spark SQL.

Spark SQL

The Spark team launched Spark SQL as a Spark component for structured data processing in 2014. Users can run SQL queries, read data from Hive, or use it as means to create Spark Datasets and DataFrames. From the release of Spark 2.1.0 it is also possible to query streaming data sources the same way as static data sources using Structured Streaming a new stream processing engine built on Spark SQL. Users can even combine different streaming and static sources in one query.

Spark SQL on Hive

As mentioned before, Spark SQL can read data from Hive. Spark SQL uses Thrift server, so first of all, we have to make sure that the Thrift server is running on every node where we installed Hive. If you don’t have Thrift server installed, follow this guide to have it up and running. Once we are done, we will be able to query all data in the Hive Metastore though Spark SQL.

Load SAP data to Hive

As we have shown you in a previous article, VirtDB can extract SAP data to Hive. So let’s try this with one of the biggest tables in SAP and extract it. We will use the Mass Data Extractor module of Data Unfolder, since it’s better suited for bigger datasets. To open Mass Data Extractor run transaction “/n/vdb/mde” in SAP.

On the MDE screen we can type in the Table/View name we want to extract and also some filtering criteria if needed. Right now we want the whole BSEG table on Hive so we leave filtering option empty.

Below the table selection, we can select the target system we’d like to export SAP data to. This time we export SAP data to Hive, so select Hive as target system, specify the folder and database name and save the settings as a variant that’ll be used to schedule jobs.

Once we are done, let’s save it all as a variant.

After we set up everything, all we need to do is define a SAP background job to deliver the data extraction using the variant we just created.

This time we will schedule an immediate export of BSEG table, but the scheduling options are basically limitless. More info on scheduling SAP extract in this article.

Query SAP data through Spark in Tableau

After the BSEG extract job executed successfully, let’s test if our we can see our SAP table extract in Tableau through the Spark SQL interface. Open Tableau connect to the Spark SQL and browse for our data. As you can see below, there is a table and a view created from the BSEG table we extracted. The table has the code names as column headers from SAP, and the view has more narrative description fields from the SAP GUI as headers. 

If you are interested in a 30-day trial, register for a demo here: