ccc dedb5e01c5 add README.md | пре 3 година | |
---|---|---|
.ipynb_checkpoints | пре 3 година | |
stoplist | пре 3 година | |
README.md | пре 3 година | |
cn_stopwords.txt | пре 3 година | |
customized_stopwords.pickle | пре 3 година | |
customized_stopwords.txt | пре 3 година | |
dict.txt.big | пре 3 година | |
gnews_keyword_extraction.ipynb | пре 3 година | |
gnews_keyword_extraction.py | пре 3 година | |
jieba_add_word.txt | пре 3 година | |
jieba_add_word_kw_with_weighting.txt | пре 3 година | |
predict_doc.txt | пре 3 година | |
renewhouse_list.pickle | пре 3 година | |
requirements.txt | пре 3 година | |
tag_list.csv | пре 3 година |
Extract keywords in target domain news from DB (gnews.gnews_detail).
First, we use Transformer to get the news semantic vector, and use HDBSCAN for clustering. Then we can use predict_doc.txt to predict which cluster the target domain may be in. Finally, we used unsupervised algorithms such as RAKE, TF-IDF, TextRank, and MultipartiteRank to extract the keywords of news in this cluster and store them in tag_list.csv.
pip install -r requirements.txt
Note: See requirements.txt for more details.
Copy the news content of the target domain and paste it to predict_doc.txt.
python gnews_keyword_extraction.py --topK 80 --target_domain_range 1
--topK
(optional) : – Get the top K keywords. (Default: 80
)--target_domain_range
(optional) : – Candidate range of target domain. 0
means only one target cluster is used, 1
means a total of three clusters are used from the side. (Default: 1
)Get keywords from tag_list.csv.