MaheshT.com

January 22, 2011

Using unix sort and join command

Filed under: Linux — MaheshT @ 9:00 pm

Matching data from two different files with one or more common keys can be done without loading files into the database by
using the sort and join command from the Unix.  Join is the command which joins data from different files but before you join the two files
it needs to be in sorted order, that’s where we going to use sort.

Here is the example  with two files having data delimited by the pipe
Data from employee.dat with key employee_id, Name, and department_id delimited by the pipe
Employee.dat
———————
4|John | 10
1|Monica|4
12|Louis|5
20|Peter|2
21|David|3
13|Barbara|6

Department file with deparment_id and its name delimited by the pipe.
Dept.dat
————
1|HR
2| Manufacturing
3| Engineering
4 |Marketing
5| Sales
6| Information technology
7| Security

Our aim is to put the department name in front of each person. As deparment_id is common in both files, we going to join by department_id. As mentioned above, the field we going to use for joining should be in sorted order. Department file it is already in sorted but in Employee.dat is not, so we going to sort it first.
Here is command for sorting Employee.dat file

sort -t “|” -k3 Employee.dat > Employee_sort.dat

Here -t “|” indicates that the fields are delimited by the pipe
-k3 for sorting file by the third field which is deparment_id
> Employee_sort.dat for sending output to Employee_sort.dat

Content of Employee_sort.dat file
20|Peter|2
21|David|3
1|Monica|4
12|Louis|5
13|Barbara|6
4|John | 7

Now we have both files are in sorted order by deparment_id, we can use the join command to put employee and department name

join -t “|” -1 3 -2 1 Employee_sort.dat Dept.dat

-t “| ” indicated files are delimited by the pipe
-1 3 for the third column of file 1 i.e deparment_id from Employee_sort.dat file
-2 1 for the first column of file 2 i.e. deparment_id from Dept.dat file

By using the above command, we get the following output.

2|20|Peter| Manufacturing
3|21|David| Engineering
4|1|Monica| Marketing
5|12|Louis| Sales
6|13|Barbara| Information technology

If you want to get everything from file 2  and corresponding entries in file 1

You can also use -a and -v options
try following commands

join -t “|” -1 3 -2 1 -v2 Employee_sort.dat Dept.dat
join -t “|” -1 3 -2 1 -a2 Employee_sort.dat Dept.dat

Powered by WordPress