Push data from SAP to Hadoop Hive
What do you think about an SAP datamart in Hadoop HIVE? In cases where – for whatever reason – SAP HANA (or BW or BW on HANA) is not an option for your analytics – you can still offload your SAP data to big data systems, like Hadoop Hive. In the next release of VirtDB Data Unfolder we have the Hadoop and Hive target systems integrated – here is a sneak preview of pushing data from SAP to Hive tables in only a few clicks.
Architecture: SAP ERP – Hadoop data integration in Data Unfolder
Since Data Unfolder has a modular architecture, the SAP data extraction part works consistently to all target systems, Data Unfolder extracts data and compresses it. Then our Data Distributor Engine converts the data to a Hadoop Hive compatible CSV file, uploads the extracted SAP data to a Hadoop cluster through WebHDFS and creates the HIVE tables and views through the ODBC connection.
Step by step: Pushing SAP transaction data to Hadoop
As a first step of integrating your operational data from SAP to a Hadoop data lake for analytics, go to your SAP ERP client execute an ABAP report (or query, view etc.). In our example we will use transaction KSB1 (Actual Cost Line items for cost centers) as a data source, but it works with any standard or custom transaction of SAP in pretty much the same way.
When you have VirtDB installed on your SAP ERP system and you have the appropriate privileges, you can switch to VirtDB administrator mode and set up the report as a datasource for VirtDB services.
Next, if you execute KSB1 you will see a new button above the grid, this is for VirtDB datasource settings.
Click the button and set the target system to Hadoop Hive by choosing the Hive UI component in the middle.
Please note: The access to your target Hive (Hadoop) system should already be set up by a VirtDB administrator. For this use-case we have accessed an AWS hosted Cloudera cluster with 4 nodes. After setting up the data source you can schedule a data extraction job to Hive by selecting “Schedule extraction” option from the VirtDB menu.
In the pop-up window in SAP GUI, fill in the information to your Hadoop cluster such as HDFS folder name, Hive database name, table name you want to use. Select the upload (ingestion) mode: choose to have a full load (overwriting existing data) or to append the extracted data to what was there already.
Clicking OK will bring you to the regular SAP Job Scheduler interface where you can set the required load frequencies, job start conditions, etc. For single execution just select “Immediate”.
If you have enabled the VirtDB authorization workflow in SAP – the job will not run until a VirtDB administrator approves it (avoiding security breaches or performance issues of wrong job schedules). The workflow functionality will be elaborated in a future use-case entry on this site.
Once the job is executed by SAP’s job engine, you may want to check out the status in the standard SAP Job overview transaction.
For monitoring all the scheduled VirtDB jobs you can use the detailed log table – This can be used as a regular datasource and allows VirtDB jobs to be monitored from a dashboard in Tableau or other BI tools. By pressing the “Job log” button in SAP job monitor, you can get Historical overview of what happened:
Logs explain the process pretty well in detail:
-The SAP job extracted the ABAP report’s data to a compressed VirtDB file and it was uploaded to a network share,
– VirtDB Data Distributor Engine (a .Net component) converted the VirtDB file to a Hive compatible CSV
– Data Distributor Engine connected to HDFS through WebHDFS client
– uploaded the CSV to HDFS
– connected to Hive through ODBC
– created the Hive table for the data (using the field types / meta info from SAP)
– created a view on the Hive table having a meaningful description as field names (using metadata from SAP)
– and finally loaded the SAP data from the CSV on HDFS to Hive table
From this point you can query the SAP data from the Hive engine by using the HUE console
or any Hive connected BI tool, like Tableau.
With VirtDB’s HDFS connectivity utilizing your SAP data in other Hadoop related technologies like Spark or Drill becomes easy as well – in future releases, Apache / AWS / Azure platforms will also be integrated.
Life is too short! Why wait for SAP data?