1001 Miglia: Manipulating text file in bash

Here I post the script I wrote in bash to help a guy working in biology. He needed to modify stuff inside a preatty long text file. Doing it by hand is what he is used to do, but it's too hard. As this is the first time I do something like that, this script is for sure not effective, slow, not elegant and with many many works-around. Any suggestion will be very appreciated. The file is a . gb, coming from a huge biological database. It is divided in may similar repeated parts, like this:

LOCUS       ABC_123                 923 bp    DNA     linear   ENV 20-MAY-2007 

DEFINITION  Uncultured bacterium clone P4T_321 16S ribosomal RNA gene, partial sequence.

ACCESSION   XYZ123051

VERSION     EF552051.1  GI:146575907

KEYWORDS    ENV.

SOURCE      uncultured bacterium

ORGANISM  uncultured bacterium

Bacteria; environmental samples.

1 gctttgcaag tcgggtgttg aaatccccag gcttaacctg ggaactgcat tcgagactgc

61 attgctagag tatgggagag ggaagtggaa tttcaggtgt agcggtgaaa tgcgtagata

121 tctgaaggaa  catcagtggc gaaagcgact tcctgggcca atactgacgc tcatgtgcga

181 aggcgtgggg agcaaacagg attagatacc ctggtagtcc acgccataaa cgatgagaac

241 tggatgtcgg gagggtctgc ctctcggtgt cgtagctaac gcgttaagtt ctccgcctgg

(...)

LOCUS       XYZ_345                 764 bp    DNA     linear   ENV 20-MAY-2007

DEFINITION  Uncultured bacterium clone P4T_200 16S ribosomal RNA gene, partial sequence.

ACCESSION   XY123050

VERSION     XY12350.1  GI:146575906

KEYWORDS    ENV.

SOURCE      uncultured bacterium

ORGANISM  uncultured bacterium

Bacteria; environmental samples.

REFERENCE   1  (bases 1 to 764)

(...)

What he needed was to replace the word after LOCUS with the word in the following line after clone (the name of a bactrium probably), and that for every occurrence of the patterns in the file. For example the first line would change in:

LOCUS       P4T_321                 923 bp    DNA     linear   ENV 20-MAY-2007 

DEFINITION  Uncultured bacterium clone P4T_321 16S ribosomal RNA gene, partial 

The idea is to find all the lines where "LOCUS" appers, and write their numbers in a file. These numbers will be used by awk and sed to substitute the fileds. This substitution is done in a for loop. Here first the corrected line is written to stout (sed -n '...s/.../p'). Then the lines down to the next line with "LOCUS" are printed. The last occurrence of the pattern "LOCUS" gave problem, so I had to insert the variable m (the number of line with "LOCUS") and compare it to the length of the input file.
This is the script:

#! /bin/bash

# Syntax:
# ./chtitle input_file > output_file

# Changes ,in the input_file, all the titles after the field "LOCUS" 
# (like P4T_321) with the ones found one line below in the 
# field "DEFINITION".
# In the script, "$1" is the input_file to change.

# finds all words "LOCUS" in file $1, and print num.lines where it finds them:
grep -n "LOCUS" $1 | cut -d: -f1 > qqq.temp

# totlines = total number of lines in $1 
totlines=$(awk 'END{print NR}' $1)

k=1;

# moves "i" in qqq.temp, where num. of lines with "LOCUS" are stored:
for i in $(cat qqq.temp); 
do  
    # "title" = string to move, in the line after "LOCUS":
    title=$(awk 'NR=='$i'+1{print $5}' $1); 

    # substitutes the word after "LOCUS" with "title", and prints the line:
    sed -n ''$i's/\(LOCUS\) [  ]* \([a-zA-Z0-9_]*\)/\1       '$title'/p' $1;

    # m = in qqq.temp, the numb. of the next line with "LOCUS":
    m=$(awk 'NR=='$k'+1{print $0}' qqq.temp);

    # len = how many characters form m? if 0 => it's the last:
    len=${#m};

    # if length(m) = 0 => give m the value "n":
    if [ "$len" -eq 0 ]; then m=n; fi 

    # echo ==== i=$i m=$m lenm=$len ===;
 
    # prints all the lines down to the next "LOCUS"
    awk 'BEGIN{ii='$i'; mm='$m'; if (mm==n) mm='$totlines'} NR>ii &&
    NR<mm {print $0}' $1; 

    k=$k+1;
done

rm qqq.temp

Update : incredible improvement from Doc:
"I am sure any perl user could do better, but I gave it a try anyway. If the files are well behaved (that is, there is always one and only one line containing "clone" after a line containing "locus" before a new "locus" is encountered), I think this works (otherwise, let me know!):

$ tac original_file | awk '{if ($4 == "clone") {tmp=$5}; if ($1 == "LOCUS"){$2=tmp};print $0}' | tac > modified_file

2 comments:

Anonymous said...: Pedo giuro io ci provo...ma non ci riesco.
In effetti mi rendo conto che questo blog ha una funzione diversa che non quella di raccontar cose agli amici...Ma non ti sei offeso vero per i miei commenti precedenti?
Non era mia intenzione svilire i tuoi sforzi telematici e di pensiero.
Sempre baci sul culo.Tutta la vita.
Masia; October 11, 2007 at 2:16 AM
franx said...: eh si, sta roba serve primariamente per compartire in rete le soluzioni computeristiche. e' chiamato open source, e mi piace di brutto. poi ogni tanto metto cose su cui mi strippo...non sono uno scrittore e i blog "oggi ho fatto questo" mi piace leggerli se di amici, li odio se di sconosciuti, e scriverli non e' cosa. Ma scusa, tu che aspetti?; October 11, 2007 at 7:23 PM

October 5, 2007

Manipulating text file in bash

2 comments:

Labels

Links

Blog Archive