.
Sqoop is a utility that can be used to transfer data between SQL based relational data stores tO/from hadoOP. The main operation this utility carry out is performing a data Import to Hadoop from supported relational data sources and Exporting data back to them. Sqoop uses connectors as extensions to connect to data stores. Currently Sqoop supports connectors to Import/Export data from popular databases like Oracle, MYSQL, Postgres, SQL Server, Teradata, DB2 etc. While SQOOP can do more, but the General use includes the following –
- Import data from database (sqoop import OR sqoop-import command)
- Export data to the database (sqoop export OR sqoop-export command)
- Create a Hive Table using the table structure of similar table in a database
SQOOP uses Map Reduce in the background to launch Jobs that can read/write data using DBInputFormat class and process data in parallel using multiple Mappers. Both Import and Export actions support -m options where the user can specify number of Mappers that SQOOP should launch to perform parallel operation. During import either a –table option can be used OR –query but not both. Here is the summary of some options below and their expected outcome –
| sqoop import –connect jdbc:mysql://server/db32 –username user_1 –password pass2 –table src_table_name –target /hdfs/dir/name –as-textfile |
Imports data from a MYSQL database named db32 using the credentials user_1/pass2. Reads the table src_table_name and places data in an HDFS directory /hdfs/dir/name as a text file.
. |
| sqoop import –connect jdbc:mysql://server/db32 –username user_1 –password pass2 –table src_table_name –columns "col1,col2,col3,col4" –append –target /hdfs/dir/name –as-sequencefile |
Imports data with settings same as above but only read the columns specified in the columns list, then append the data to an already existing HDFS directory and save data in Sequence files. |
|
sqoop import |
Imports data from MYSQL but use the query to select data instead of using a table and run 6 Mapper jobs (instead of default 4) to read data in parallel, and use the values of UID column to create 6 queries that can run in parallel. Suppose if UID column has values from 1 to 60,000, the $CONDITIONS will be replaced to create conditions like 1 to 10000, 10001 to 20000, … etc. However, if the table doesn’t have any primary key column, then either use -m 1 (OR) use –split-by clause.
. |
| sqoop create-hive-table –connect jdbc:mysql://server/db32 –username user_3 –password pass4 –table src_table_name –fields-terminated-by ‘,’ |
Creates a Hive table with the same name and schema as the source table. The destination data files will be saved with fields separated by a comma and if the destination table already exists, an Error will be thrown.
. |
| sqoop import –connect jdbc:mysql://server/db32 –username user_3 –password pass4 –table src_table_name –hive-import |
Imports data from source database, creates a table in Hive and then load the data into the Hive table. If the Hive table already exists and there is data in it, then an Append will happen.
. |
| sqoop export –connect jdbc:mysql://server/db32 –username user_3 –password pass4 –table db_table_name –export-dir /read/from/dir |
Exports data from HDFS to MYSQL into the db_table_name. The table in database must exists already and so Sqoop will append data into it.
. |
| sqoop export –connect jdbc:mysql://server/db32 –username user_3 –password pass4 –table db_table_name –update-key UID –export-dir /read/from/dir –input-fields-terminated-by ‘,’ |
Exports data from HDFS to the database but tells Sqoop that the fields in the input data are separated by the comma and if the value of UID is already found in the database table then perform an update instead of inserting the row.
. |
Happy Sqooping 🙂
.
Leave a comment