|
|
4 éve | |
|---|---|---|
| .ipynb_checkpoints | 4 éve | |
| stoplist | 4 éve | |
| README.md | 4 éve | |
| cn_stopwords.txt | 4 éve | |
| customized_stopwords.pickle | 4 éve | |
| customized_stopwords.txt | 4 éve | |
| dict.txt.big | 4 éve | |
| gnews_keyword_extraction.ipynb | 4 éve | |
| gnews_keyword_extraction.py | 4 éve | |
| jieba_add_word.txt | 4 éve | |
| jieba_add_word_kw_with_weighting.txt | 4 éve | |
| predict_doc.txt | 4 éve | |
| renewhouse_list.pickle | 4 éve | |
| requirements.txt | 4 éve | |
| tag_list.csv | 4 éve |
Extract keywords in target domain news from DB (gnews.gnews_detail).
First, we use Transformer to get the news semantic vector, and use HDBSCAN for clustering. Then we can use predict_doc.txt to predict which cluster the target domain may be in. Finally, we used unsupervised algorithms such as RAKE, TF-IDF, TextRank, and MultipartiteRank to extract the keywords of news in this cluster and store them in tag_list.csv.
pip install -r requirements.txt
Note: See requirements.txt for more details.
Copy the news content of the target domain and paste it to predict_doc.txt.
python gnews_keyword_extraction.py --topK 80 --target_domain_range 1
--topK (optional) : – Get the top K keywords. (Default: 80)--target_domain_range (optional) : – Candidate range of target domain. 0 means only one target cluster is used, 1 means a total of three clusters are used from the side. (Default: 1)Get keywords from tag_list.csv.