This page contains links to some of the resources and tools which we have used in the case studies in Section C of the book. We regret being unable to make some corpora and packages available on this site due to license restrictions. For these, readers are advised to check the Corpus Survey for more information. This site also provides resources that are not used in the book but are useful in corpus-based research. Each of the case studies in the book only demonstrates the use of one corpus exploration package. Demonstrations will be given at this page of how to undertake the case studies using different packages. This page is constantly being updated and enriched with new resources.


Scott Songlin Piao's Multilingual Corpus Tools (requires Java) (474 k)

Tool for preprocessing the Longman Learners' Corpus (requires PERL) (case study 3)

Tool for computing factor scores (requires PERL) (case study 5) (1 k) Back to Top

Search algorithms for use in MF/MD (case study 5)

File-based search patterns for use with the search algorithms (case study 5) (14 k)

Wordlist based on the BNC World Edition (for use with WordSmith 3 in case study 5) (1,630 k)

Wordlist based on FLOB (for use with WordSmith 3 in case study 5) (126 k)

Wordlist for academic prose (for use with WordSmith 3 in case study 5) (171 k)

Wordlist for speech (for use with WordSmith 3 in case study 5) (141 k)

Wordlist for conversation (for use with WordSmith 3 in case study 5) (89 k)

Wordlists (in batches) for three genres (for use with WordSmith 3 in case study 5) (132 k)

Key keywords for three genres (for use with WordSmith 3 in case study 5) (117 k)

Tagged CPSA Corpus (comm797 and comr 797) (case study 5; note that there two two copies of comr797, one in a single file and one split into halves) (692 k)

Tagged Frown J category (case study 5) (537 k)

The Chinese-English Parallel Corpus of Public Health (case study 6) (197 k)

The Comparable Chinese Corpus of Public Health (case study 6) (82 k)

The Weekly Corpus of Chinese (case study 6) (269 k)

Academia Sinica Balanced Corpus of Modern Chinese (External link)

Peking University Chinese Corpus (External link, web interface in Chinese)

Peking University Babel Chinese-English Parallel Corpus (External link, web interface in Chinese)

Xiamen University corpora (External link, web interface in Chinese)

Beijing Language and Culture University corpus (External link, web interface in Chinese)

Search Algorithms

These search algorithms are designed to extract the 58 linguistic features from CLAWS tagged corpora (C7) for use in a multi-feature/multi-dimensional analysis. A detailed discussion of the functions of these linguistic features can be found in Biber (1988: 211-245). File-based search patterns can be downloaded below. After downloading, extract these compressed text files into c:\wsmith. These algorithms are designed for use with WordSmith Tools version 3.

After starting WordSmith, go to ‘Settings – Tags’ and activate ‘Tags to ignore’ (<*>). This allows the program to ignore all elements included in the angular brackets (metadata, comments, etc) in the corpus files. Copy and paste these search patterns into the text box ‘Search word or phrase’. Adjust ‘Context words & Context search horizons’ (left and right) where appropriate as specified for individual algorithms.

Factor 1 (28 linguistic features):

(1) private verbs: c:\wsmith\privatev.txt

(2) THAT deletion: c:\wsmith\thatdel1.txt – c:\wsmith\thatdel8.txt

(3) contraction: *'*

Context 1L 2R =~*_GE/~"_"/~*_NP*/~*_NN*/~*_MC*/~*_RA/~*_UH*/~*_FO/~'_"

(4) present tense verbs: c:\wsmith\present.txt

(5) 2nd person pronouns: *_PPY/your_APPGE/yourself_PPX1/yourselves_PPX2/ yours_PPGE

(6) DO as pro-verb: *_VD*

Context 0L 4R =~*_XX/~*_PPY/~*_PP?S*/~*_V?I

(7) analytic negation: *_XX

(8) demonstrative pronouns: this_DD1/that_DD1/these_DD2/those_DD2

Context 0L 3R=~*_NN*/~*_NP*/~*_PN1

(9) general emphatics: c:\wsmith\emphatic.txt

(10) 1st person pronouns: *_PPI*/my_APPGE/our_APPGE/myself_PPX1/ourselves _PPX2/mine_PPGE/ours_PPGE

(11) pronoun IT: it_PPH1

(12) BE as main verb: *_VB*

Context 0L 3R =*_D*/*_A*/*_NNB/*_I*/*_J*/~*_V?G/~*_V?N

(13) causative subordination: because_CS

(14) discourse markers:

a) well_* context 1L 0R = ~AS_*/~FEEL*_V*/~FELT_V*;

b) now_*/anyway*_*/anyhow_*

Context 2L 0R =?_?/AND_*/BUT_*/*_UH/~*_V*/~RIGHT_*

(15) indefinite pronouns: none_PN/*_PN1

(16) general hedges: c:\wsmith\hedge.txt

(17) amplifiers: c:\wsmith\amplify.txt

(18) sentence relatives: ,_, which_DDQ

(19) WH questions: ?_? WHAT_DDQ/?_? *_RRQ

Context 0L R3 =*_VD*/*_VB*/*_VH*/*_VM*

(20) possibility modals: can_VM/ca_VM/could_VM/may*_VM/might_VM

(21) non-phrasal coordination:

a) ,_, AND_CC IT_P*/,_, AND_CC SO_*/,_, AND_CC THEN_*/,_, AND_CC YOU_PPY*


c) ,_, AND_CC TH*_DD1/,_, AND_CC TH*_DD2/,_, AND_CC *_PP?S*

(22) WH clauses: c:\wsmith\pps.txt context 0L 3R= *_DDQ/~?_?/~*_I*

(23) final prepositions: *_I* context 0L 2R=?_?/~(_(

(24) other nouns: *_NN*/*_NP*/*_ND1

Context 0L 0R = ~*TION*_N*/~*MENT*_N*/~*NESS*_N*/~*ITY_N*/~*ITIES _N*

(25) word length: (WordSmith wordlist function: average word length)

(26) prepositions: *_I*

(27) type/token ratio: (WordSmith wordlist function: standardized type/token ratio)

(28) attributive adjectives: *_JJ *_NN*/*_JJ *_JJ

Factor 2 (6 linguistic features):

(29) past tense verbs: *_V?D*

(30) 3rd person pronouns: c:\wsmith\3persprn.txt

(31) perfect aspect verbs: c:\wsmith\perf_asp.txt

(32) public verbs: c:\wsmith\publicv.txt

(33) synthetic negation: no_AT/neither_*/nor_*

(34) present participial clauses: ,_, *_V?G *_I*/,_, *_V?G *_D*/,_, *_V?G *_P*/,_, *_V?G *_R*

Context L3 0R= ~*_VB*

Factor 3 (7 linguistic features):

(35) WH relative clauses: *_NN* *_PNQ*/WHICH*_DDQ*/WHOSE_DDQGE

Context 1L 0R= ~ASK*_V*/~TELL*_V*/~TOLD_V*/~*_I*/~?_?

(36) pied piping constructions: *_NN* *_PNQ*/WHICH*_DDQ*/WHOSE_DDQGE

Context 1L 0R =*_I*

(37) phrasal coordination: *_R* and _CC *_R*/*_J* and_CC *_J*/*_V* and_CC *_V*/*_N* and_CC *_N*

(38) nominalizations: *tion_N*/*_tions_N*/*ment_N*/*ments_N*/*ness_N*/ *nesses_N*/*ity_N*/*ities_N*

(39) time adverbials: *_RT*

(40) place adverbials: *_RL*

(41) other adverbs: *_R* minus all totals of hedges, amplifiers, downtoners, place adverbials and time adverbials

Factor 4 (6 linguistic features):

(42) infinitives: to_TO *_V?I/to_TO *_R* *_V?I/to_TO *_R* R_* *_V?I

(43) prediction modals: will_VM/wo_VM/shall_VM/sha_VM/'ll_VM/would_VM/ 'd_VM

(44) suasive verbs: c:\wsmith\suasivev.txt

(45) conditional subordination: if_CS/unless_CS

(46) necessity modals: ought_VM*/should_VM/must_VM

(47) split auxiliaries: c:\wsmith\splitaux.txt

Factor 5 (6 linguistic features):

(48) conjuncts: c:\wsmith\conjunct.txt

(49) agentless passives: c:\wsmith\agtlspsv.txt

Context 0L 6R=~by_II

(50) past participial clauses: ?_? *_V?N *_I*/?_? *_V?N *_R*

(51) BY-passives: c:\wsmith\by_psv.txt

Context 0L 6R=by_II

(52) past participial WHIZ deletions: c:\wsmith\whizdel.txt

Context 2L 0R= ~GET*_V*/~GOT_V*/~*_VH*

(53) other adverbial subordinators: c:\wsmith\otheradv.txt

Factor 6 (4 linguistic features):

(54) THAT clauses as verb complements: *_V* that_CST

(55) demonstratives: THESE_DD2/THOSE_DD2/THIS_DD1/THAT_DD1

Context 0L 3R= *_NN*/*_NP*/*_PN1

(56) THAT relative clauses: *_NN* THAT_CST

Context 0L 4R= *_AT*/*_D*/*_NP*/*_PP*/*_N*2*

(57) THAT clauses as adjective complements: *_JJ that_CST

Context 1L 0R= ~so_*


Factor 7 (1 linguistic feature):

(58) SEEM/APPEAR: seem*_V*/appear*_V


