Ei kuvausta

ccc dedb5e01c5 add README.md 3 vuotta sitten
.ipynb_checkpoints d03252ce59 convert to .py file 3 vuotta sitten
stoplist d03252ce59 convert to .py file 3 vuotta sitten
README.md dedb5e01c5 add README.md 3 vuotta sitten
cn_stopwords.txt 970333fe6c first commit 3 vuotta sitten
customized_stopwords.pickle d03252ce59 convert to .py file 3 vuotta sitten
customized_stopwords.txt d03252ce59 convert to .py file 3 vuotta sitten
dict.txt.big d03252ce59 convert to .py file 3 vuotta sitten
gnews_keyword_extraction.ipynb d03252ce59 convert to .py file 3 vuotta sitten
gnews_keyword_extraction.py d03252ce59 convert to .py file 3 vuotta sitten
jieba_add_word.txt 970333fe6c first commit 3 vuotta sitten
jieba_add_word_kw_with_weighting.txt 970333fe6c first commit 3 vuotta sitten
predict_doc.txt d03252ce59 convert to .py file 3 vuotta sitten
renewhouse_list.pickle 970333fe6c first commit 3 vuotta sitten
requirements.txt dedb5e01c5 add README.md 3 vuotta sitten
tag_list.csv d03252ce59 convert to .py file 3 vuotta sitten

README.md

Gnews Keyword Extraction

Extract keywords in target domain news from DB (gnews.gnews_detail).

First, we use Transformer to get the news semantic vector, and use HDBSCAN for clustering. Then we can use predict_doc.txt to predict which cluster the target domain may be in. Finally, we used unsupervised algorithms such as RAKE, TF-IDF, TextRank, and MultipartiteRank to extract the keywords of news in this cluster and store them in tag_list.csv.

Install

pip install -r requirements.txt

Note: See requirements.txt for more details.

Usage

Step 1

Copy the news content of the target domain and paste it to predict_doc.txt.

Step 2

python gnews_keyword_extraction.py --topK 80 --target_domain_range 1
  • --topK (optional) : – Get the top K keywords. (Default: 80)
  • --target_domain_range (optional) : – Candidate range of target domain. 0 means only one target cluster is used, 1 means a total of three clusters are used from the side. (Default: 1)

Step 3

Get keywords from tag_list.csv.