October 5, 2007

Manipulating text file in bash

Here I post the script I wrote in bash to help a guy working in biology. He needed to modify stuff inside a preatty long text file. Doing it by hand is what he is used to do, but it's too hard. As this is the first time I do something like that, this script is for sure not effective, slow, not elegant and with many many works-around. Any suggestion will be very appreciated. The file is a . gb, coming from a huge biological database. It is divided in may similar repeated parts, like this:

LOCUS ABC_123 923 bp DNA linear ENV 20-MAY-2007
DEFINITION Uncultured bacterium clone P4T_321 16S ribosomal RNA gene, partial sequence.
ACCESSION XYZ123051
VERSION EF552051.1 GI:146575907
KEYWORDS ENV.
SOURCE uncultured bacterium
ORGANISM uncultured bacterium
Bacteria; environmental samples.
1 gctttgcaag tcgggtgttg aaatccccag gcttaacctg ggaactgcat tcgagactgc
61 attgctagag tatgggagag ggaagtggaa tttcaggtgt agcggtgaaa tgcgtagata
121 tctgaaggaa catcagtggc gaaagcgact tcctgggcca atactgacgc tcatgtgcga
181 aggcgtgggg agcaaacagg attagatacc ctggtagtcc acgccataaa cgatgagaac
241 tggatgtcgg gagggtctgc ctctcggtgt cgtagctaac gcgttaagtt ctccgcctgg
(...)

LOCUS XYZ_345 764 bp DNA linear ENV 20-MAY-2007
DEFINITION Uncultured bacterium clone P4T_200 16S ribosomal RNA gene, partial sequence.
ACCESSION XY123050
VERSION XY12350.1 GI:146575906
KEYWORDS ENV.
SOURCE uncultured bacterium
ORGANISM uncultured bacterium
Bacteria; environmental samples.
REFERENCE 1 (bases 1 to 764)
(...)


What he needed was to replace the word after LOCUS with the word in the following line after clone (the name of a bactrium probably), and that for every occurrence of the patterns in the file. For example the first line would change in:

LOCUS P4T_321 923 bp DNA linear ENV 20-MAY-2007
DEFINITION Uncultured bacterium clone P4T_321 16S ribosomal RNA gene, partial


The idea is to find all the lines where "LOCUS" appers, and write their numbers in a file. These numbers will be used by awk and sed to substitute the fileds. This substitution is done in a for loop. Here first the corrected line is written to stout (sed -n '...s/.../p'). Then the lines down to the next line with "LOCUS" are printed. The last occurrence of the pattern "LOCUS" gave problem, so I had to insert the variable m (the number of line with "LOCUS") and compare it to the length of the input file.
This is the script:

#! /bin/bash

# Syntax:
# ./chtitle input_file > output_file

# Changes ,in the input_file, all the titles after the field "LOCUS" 
# (like P4T_321) with the ones found one line below in the 
# field "DEFINITION".
# In the script, "$1" is the input_file to change.

# finds all words "LOCUS" in file $1, and print num.lines where it finds them:
grep -n "LOCUS" $1 | cut -d: -f1 > qqq.temp

# totlines = total number of lines in $1 
totlines=$(awk 'END{print NR}' $1)

k=1;

# moves "i" in qqq.temp, where num. of lines with "LOCUS" are stored:
for i in $(cat qqq.temp); 
do  
    # "title" = string to move, in the line after "LOCUS":
    title=$(awk 'NR=='$i'+1{print $5}' $1); 

    # substitutes the word after "LOCUS" with "title", and prints the line:
    sed -n ''$i's/\(LOCUS\) [  ]* \([a-zA-Z0-9_]*\)/\1       '$title'/p' $1;

    # m = in qqq.temp, the numb. of the next line with "LOCUS":
    m=$(awk 'NR=='$k'+1{print $0}' qqq.temp);

    # len = how many characters form m? if 0 => it's the last:
    len=${#m};

    # if length(m) = 0 => give m the value "n":
    if [ "$len" -eq 0 ]; then m=n; fi 

    # echo ==== i=$i m=$m lenm=$len ===;
 
    # prints all the lines down to the next "LOCUS"
    awk 'BEGIN{ii='$i'; mm='$m'; if (mm==n) mm='$totlines'} NR>ii &&
    NR<mm {print $0}' $1; 

    k=$k+1;
done

rm qqq.temp


Update : incredible improvement from Doc:
"I am sure any perl user could do better, but I gave it a try anyway. If the files are well behaved (that is, there is always one and only one line containing "clone" after a line containing "locus" before a new "locus" is encountered), I think this works (otherwise, let me know!):
$ tac original_file | awk '{if ($4 == "clone") {tmp=$5}; if ($1 == "LOCUS"){$2=tmp};print $0}' | tac > modified_file

"

2 comments:

Anonymous said...

Pedo giuro io ci provo...ma non ci riesco.
In effetti mi rendo conto che questo blog ha una funzione diversa che non quella di raccontar cose agli amici...Ma non ti sei offeso vero per i miei commenti precedenti?
Non era mia intenzione svilire i tuoi sforzi telematici e di pensiero.
Sempre baci sul culo.Tutta la vita.
Masia

franx said...

eh si, sta roba serve primariamente per compartire in rete le soluzioni computeristiche. e' chiamato open source, e mi piace di brutto. poi ogni tanto metto cose su cui mi strippo...non sono uno scrittore e i blog "oggi ho fatto questo" mi piace leggerli se di amici, li odio se di sconosciuti, e scriverli non e' cosa. Ma scusa, tu che aspetti?