#*******************************************************
# Developed by Alexei Sharov (sharoval@mail.nih.gov), 
# National Institute on Aging (NIA/NIH).
# 
# This software is provided "AS IS".  NIA makes no warranties, express
# or implied, including no representation or warranty with respect to
# the performance of the software and derivatives or their safety,
# effectiveness, or commercial viability.  NIA does not warrant the
# merchantability or fitness of the software and derivatives for any
# particular purpose, or that they may be exploited without infringing
# the copyrights, patent rights or property rights of others. NIA shall
# not be liable for any claim, demand or action for any loss, harm,
# illness or other damage or injury arising from access to or use of the
# software or associated information, including without limitation any
# direct, indirect, incidental, exemplary, special or consequential
# damages.
# 
# This software program may not be sold, leased, transferred, exported
# or otherwise disclaimed to anyone, in whole or in part, without the
# prior written consent of NIA.
#*******************************************************

CisFinder is designed to find over-represented DNA motifs in ChIP-chip or 
ChIP-seq data and to scan sequences for motif matches.
Disclaimer: these programs may malfunction without generating an error 
message or crash if you use wrong syntax or inappropriate input file format.

1. INSTALLATION

PC/Windows: Download the executable file from http://lgsun.grc.nia.nih.gov/cisfinder/download.html
Other OS: Download the code from the same site and compile it using gcc with -lm option

2. Program "patternFind"

Function: generates position frequency matrixes (PFM) for motifs over-represented in the
test DNA sequence compared to control sequence

Syntax:
patternFind -i inputFasta -o outputFile [-c controlFile, -pos positionFile,
-ratio minEnrichmentRatio, -FDR maxFDR, -maxlen maxSequenceLength, -len motifLength,
-strand strandOption, -score scoreOption, -userep, -getrep, -one, -cg, -brief,
-n numberOfMotifs]

Comments:
(a) If controlFile is not specified, then random sequence generated using 3rd order
Markov chain is used as a control.
(b) positionFile - stores position of each motif match in the test sequence file.
It can be later used for motif clustering on the basis of their co-occurrence.
(c) maxFDR = maximal False Discovery Rate (FDR) threshold. The program generates at least 
100 motifs even if they are not significant; additional motifs are included only 
if they are significant (i.e. FDR < FDR threshold)
(d) motifLength = 8 as a default. Possible values are 6, 8, and 10 only.
(e) minEnrichmentRatio = minimum enrichment ratio, default = 1.5
(f) strandOption: 0 = search both strands, 1 = search positive strand, 2 = search
positive strand and use negative strand as a control.
(g) scoreOption:
0 = use z-score (z) for motif over-representation for ordering motifs in the output file
1 = use z*(ratio-1) for sorting motifs, where ratio = over-representation ratio.
2 = use z*info for sorting motifs, where info = information content.
3 = use z*(ratio-1)*info
4 = use z*(1-selfsim) for sorting motifs, where selfsim = self-similarity of motif.
5 = use z*(ratio-1)*(1-selfsim)
6 = use z*info*(1-selfsim)
7 = use z*(ratio-1)*info*(1-selfsim)
(e) userep: use repeats in sequence (lower-case in sequence)
(f) getrep: generate repeat output file (motifs over-represented in repeats).
(g) one: consider not more than 1 motif occurrence per sequence
(h) cg: adjust motif abundance to C/G and CpG accurrence in test and control sequences
(i) brief: do not generate PFM.
(j) numberOfMotifs = maximum number of motifs to be generated (default = 500)

3. Program "patternCluster"

Function: clusters motifs with position frequency matrixes (PFM) 
based on PFM similarity and/or motif co-occurrence in the test sequence.
Similarity is measured by correlation of position-weight matrices (PWM) which
are log-transformed PFMs.

Syntax:
patternCluster -i inputMotifFile -o outputFile [-pos positionFile, 
-match matchThreshold, -n numberOfMotifs, -repeat maxRepeatEnrichment, -posonly]

Comments:
(a) positionFile - stores position of each motif match in the test sequence file.
It is used for motif clustering on the basis of their co-occurrence.
If positionFile is not specified, then clustering is done exclusively on the basis
of similarity of motifs, otherwise co-occurrence method is used at least for
clustering of motifs with high level of self-similarity (>0.5 on average). However
if "posonly" option is used, then clustering is based exclusively on co-occurrence
of motifs.
(b) posonly = motifs are clustered exclusively based on co-occurrence.
(c) matchThreshold = minimum similarity (correlation of PWMs) between motifs;
default value = 0.75.
(d) numberOfMotifs can be used to limit the number of input motifs
(e) maxRepeatEnrichment (ratio threshold) can be used to filter out motifs with 
high over-representation in repeats.

4. Program "patternCompare"

Function: compare motifs between 2 files based on PFM similarity.
Similarity is measured by correlation of position-weight matrices (PWM) which
are log-transformed PFMs.

Syntax:
patternCompare -i1 motifFile1.txt -i2 motifFile2 -o outputFile [-match matchThreshold]

Comments:
(a) matchThreshold = minimum similarity (correlation of PWMs) between motifs;
default value = 0.75.

5.  Program "patternTest"

Function: improve motifs using resampling method

Syntax:
patternTest -i motifFile -f fastaFile -o outputFile [-c controlFile, 
-prog progressFile, -maxlen maxSequenceLength, -strand strandOption, 
-n numberOfMotifs, -method resampleMethod, -iter numIteartions, 
-siter numSubiterations, -fp falsePositives, -userep, -one]

Comments:
(a) If controlFile is not specified, then random sequence generated using 3rd order
Markov chain is used as a control.
(b) progressFile keeps records for each iteration and subiteration.
(c) strandOption: 0 = search both strands, 1 = search positive strand, 2 = search
positive strand and use negative strand as a control.
(d) numberOfMotifs can be used to limit the number of input motifs (default=100)
(e) resampleMethod: 1=regression method (default), 2=simple resampling, 3=difference method.
(f) numIteartions = number of main iterations when a new set of motif matches 
is generated (default = 3)
(g) numSubiterations = number of sub-iterations which process the same set of motif
matches, and only the match score changes (default = 3).
(e) userep: use repeats in sequence (lower-case in sequence)
(g) one: consider not more than 1 motif occurrence per sequence (with highest 
match score)
(h) falsePositives = number of expected false positives per 10000 bp in the
control sequence (default = 5).

6. Program "patternScan"

Function: finds sites in a sequence that match to motifs specified by PFM

Syntax:
patternScan -i motifFile -f fastaFile -o outputFile [-cons conservationFile, 
-maxlen maxSequenceLength, -strand strandOption, -n numberOfMotifs, 
-fp falsePositives, -thresh matchThreshold, -userep, -one]

Comments:
(a) conservation file = file with evolutionary conservation scores (see format below)
(b) strandOption: 0 = search both strands, 1 = search foreward strand.
(c) numberOfMotifs can be used to limit the number of input motifs (default=100)
(d) falsePositives = number of expected false positives per 10000 bp in a random
sequence (default = 5 if matrix-specific thresholds are not supplied).
(e) matchThreshold = addition to matrix-specific thresholds. For example if
it is equal to 0.7, then all match thresholds are incremented by 0.7 and
search becomes more stringent.
(f) userep: use repeats in sequence (lower-case in sequence)
(g) one: consider not more than 1 motif occurrence per sequence (with highest 
match score)
(h) Matrix-specific match thresholds are automatically generated by the
patternTest program, but they can be modified, added, or removed manually.
The header for the threshold field should be "Threshold" (see file formats).

7. Program "patternDistrib"

Function: finds sites in a sequence that match to motifs specified by PFM

Syntax:
patternDistrib -i scanResults -f frequencyOutput -a abundanceOutput [-int intervalForFreq]

Comments:
(a) scanResults = file generated by "patternScan" program
(b) frequencyOutput = shows the frequency distribution of binding sites along 
sequences aligned by their starting position.
(c) abundanceOutput = shows the number of motif matches in each sequence. It can be used
for classifying sequences based on the composition of binding motifs.
(d) intervalForFreq = interval (bp) used for calculation of frequencyOutput 
file (default = 100bp).

8. File formats

All files used in CisFinder are in plain text format and use tabs for delimiting
fields within one line. Total there are 4 main types of files in CisFinder: sequences, motifs,
search results, and repeats. Motifs can be submited as PFMs and as degenerate
consensus sequences, which are converted to the PFM form. These 4 types of files
may have 2 optional lines of annotation that start with a keyword "Parameters:"
and "Headers:". Parameters may specify the origin of the file, e.g., algorithm
parameters for derivation, whereas headers specify the columns of tab-delimited lines.
Additional 3 kinds of files can be associated with a sequence file:
genomic coordinates, attributes (e.g., gene names), and conservation scores.<p>

Sequence file has a fasta format: sequence name is preceded by ">" sign
and the following lines contain the sequence. Example:
>PET070528
AACCCAAAGTATGATATGCTATGATAGATAACCAAAAGGTAATATTATGAAATTTTTATCAACTATAATTATATAACTTG
AAACTGTTTCCTAAATCCGCCCTAGAGCTTACACAAAGCTGAGGGAAGTTTGCTGGAAAGTTCAGGCTGAGTGGGATGTT
>PET070400
TACTATTGGCGCTTCAATCAGTATTCGTCTTTTATAATACAATAATGCTATTTTGGATAAGTAAGTTTCTATTCAAGGAC
ACGTGTGGGCAGCTGTAACACTAATAATGTCCCATAAATAAGCGAGCAGAGCACATACTGCTGAGACAGACATGTAAGAA

To extract DNA sequences use RSAT (http://rsat.scmbb.ulb.ac.be/rsat>RSAT)
or our PERL script "extract_genome_seq.pl" at http://lgsun.grc.nia.nih.gov/cisfinder/download.html.
We recommend using 200-bp sequences centered at the expected binding site of the TF.
ChIP-chip usually has a lower spatial resolution, thus the size of sequences can be
increased to 300 or even 400 bp.

Motif file is formatted as follows. The first line starts with a ">" sign
followed by motif name. The same line may contain additional tab-delimited fields:

Pattern 	(=consensus),
PatternRev 	(=reverse consensus),
Threshold 	(=threshold score),
Nmembers 	(=number of member motifs in motif cluster),
Freq 	(=number of motif matches in the test sequence),
Ratio 	(=enrichment ratio of motif matches in the test sequence),
Info 	(=information content of motif PFM),
Score 	(=motif score used for ordering),
FDR 	(=False Discpvery Rate),
Repeat 	(=enrichment ratio of this motif in repeat sequences),
Palindrome 	(=1 if palindrome, and =0 othewise),
Method 	(=method of motif clustering: 0 for similarity, and 1 for co-occurrence),
Species 	(=taxonomy of organism),

The PFM is formatted in 5 columns: position number (starting from 0), followed by frequency
of nucleotides A, C, G, and T respectively. There is an empty line at the end of each
matrix. Example of motif file:

Parameters:	matchThresh=0.8500	nucleotideOrder=A,C,G,T
Headers:	Name	Pattern	PatternRev	Freq	Ratio	Info	Score	p	FDR	Palindrome	Nmembers
>SOX9	CCWTTGTT	KAACAAWG	12813	4.5857	9.407	176.0950	0.0000	0.0000	0	151
0	2	72	13	13
1	3	72	2	23
2	46	1	2	51
3	1	2	1	96
4	1	1	1	96
5	2	4	93	1
6	3	2	2	93
7	6	26	13	55

>OCT	HATGCWAA	ATTWGCAT	404	3.1375	10.055	18.8350	0.0000	0.0000	0	4
0	93	3	3	2
1	1	1	2	96
2	1	1	97	1
3	3	76	10	11
4	61	3	1	36
5	80	2	17	2
6	96	1	2	1
7	3	15	14	69

If motifs are uploaded as a pattern (consensus), then the file is formatted as follows:

Parameters:	matchThresh=0.8500	nucleotideOrder=A,C,G,T
Headers:	Name	Pattern	TFgroup	Info
>MIT_001NRF1	RCGCANGCGY	NRF1	16
>MIT_002MYC	CACGTG	MYC	12
>MIT_003ELK	SCGGAAGY	ELK	14
>MIT_005NFY	GATTGGY	NFY	13
>MIT_006SP1	GGGCGGR	SP1	13
>MIT_007AP1	TGANTCA	AP1	12
>MIT_009ATF	TGAYRTCA	ATF	14
>MIT_010YY1	GCCATNTTG	YY1	16

Search results are formatted as a tab-delimited text file with the following
columns: MotifName, SeqName (sequence name), Strand, Len (motif length), Start (starting position counted from 0),
Score (matching score), Sequence (matching sequence), Conservation (evolutionary conservation score, from 0 to 100).
The "Parameters:" line should specify the name of the sequence (file_fasta) file that was searched
and the motif name that contain PFM (file_motif). Here is a example of search results:

Parameters:	file_motif=public-motif_pluripotent	file_fasta=public-P300_binding.fa
Headers:	MotifName	SeqName	Strand	Len	Start	Score	Sequence	Conservation
STAT3	Chen2008P300000002-900-1100	-	9	119	4.3932	TTCCCGGAA	40
TEF	Chen2008P300000005-900-1100	-	8	87	3.3398	AGGAATGC	0
NANOG	Chen2008P300000004-900-1100	+	9	116	3.5202	CCACTTCCT	1
KLF	Chen2008P300000004-900-1100	+	8	96	3.7154	CTCCACCC	1
KLF4	Chen2008P300000004-900-1100	+	9	109	4.0565	GCCACACCC	1
SOX9	Chen2008P300000003-900-1100	-	8	60	3.7513	CCATTGTT	28

Repeat file is formatted as a list of repeat motifs, each described with 2 lines.
The first line has the following tab-delimited fields:
Index for gap structure (shown at the left in Fig. 5)
Index for the 8-mer word (nucleotides A,C,G,T are encoded as 0,1,2,3; then the index is
calculated as c1+c2*4+c3*16+c4*64+..., where ci is the code of position i)
Enrichment ratio in comparison to non-repeats
Motif total length
Repeat name (optional)

The second line shows the full PFM (all lines concatenated), space- or tab-delimited.
Motif file can be generated by the "patternFind" program.

Sequence conservation file has a fasta format; sequence name is preceded by ">" 
sign and the following lines contain conservation scores for the sequence, comma separated.
Conservation score ranges from 0 to 100 and can be downloaded from UCSC web site (there it
ranges from 0 to 1, hence it needs to be multiplied by 100).
The first line in the file should specify parameters: interval, and genome. Normally we use
interval=5, which means that each conservation score correspond to 5 bp of the sequence. If
interval is not described it is assumed to =1. Here is an example of conservation file:
Parameters:	interval=5	genome=mm9
>Chen2008P300000006
2,2,2,4,7,8,11,12,11,7,8,7,5,4,3,2,2,3,3,2,1,0,1,1,1,3,3,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,1,
3,1,1,3,4,2,3,4,4,3,0,1,0,1,1,0,0,10,27,36,47,72,64,48,27,1,2,16,22,25,35,40,11,4,4,3,4,5,6,0,0,5,10,11,4,1,0,3,0,2,
42,55,20,0,0,0,0,2,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,0,
0,2,2,0,0,7,21,24,17,5,2,1,1,3,5,7,4,0,0,0,1,3,4,4,3,11,12,12,9,4,1,0,1,0,0,0,0,1,2,2,1,2,1,0,0,0,0,1,0,2,
5,8,10,3,0,1,1,2,0,0,0,4,0,0,0,0,1,3,6,3,1,0,0,0,0,0,1,2,4,6,5,2,2,4,4,13,26,45,57,63,66,66,62,55,41,19,10,5,3,2,
2,5,8,7,4,0,0,0,0,0,0,0,0,2,4,4,4,3,3,1,1,0,1,4,3,1,2,6,7,5,3,7,13,20,28,32,34,39,40,38,32,19,9,1,0,1,2,1,1,0,
1,4,2,0,0,6,6,2,1,0,0,0,0,0,1,2,3,0,0,1,4,4,2,1,0,0,0,1,2,0,1,0,0,0,0,0,1,0,1,1,1,1,0,1,12,20,25,26,23,15,
11,11,21,28,30,28,23,24,27,30,33,35,38,40,39,38,38,34,25,15,9,6,4,4,3,4,8,8,9,12,14,12,6,2,1,0,0,0,0,1,3,3,4,9,12,13,14,9,3,1,
>Chen2008P300000007
0,0,1,3,4,4,3,2,1,1,2,2,3,5,4,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,2,12,23,28,30,34,37,35,29,22,16,15,17,16,10,6,3,2,
4,6,7,8,8,8,6,4,2,4,5,4,1,1,2,2,3,5,6,4,1,2,1,1,2,2,2,1,1,3,6,7,8,10,12,10,8,11,18,22,21,17,11,9,9,7,2,1,0,1,
0,0,0,1,3,3,0,1,18,21,14,0,0,0,0,0,0,0,0,0,1,9,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,16,58,97,57,1,11,40,39,87,92,41,24,2,30,28,1,15,99,92,96,97,99,100,100,98,100,95,100,100,100,100,84,87,100,100,100,100,100,100,100,100,100,
100,100,100,100,100,100,100,100,100,99,98,99,100,97,99,100,100,99,99,100,100,98,100,78,88,100,100,100,100,100,99,77,16,49,99,98,99,100,100,99,93,19,2,0,4,5,12,46,100,100,
100,100,100,44,2,3,15,96,100,100,100,87,92,98,99,99,33,0,3,17,61,81,88,95,100,100,100,100,98,100,100,100,99,21,81,89,98,100,100,100,99,100,99,98,100,100,100,100,99,100,

End of readme.txt