4 år sedan · dedb5e01c5
--- a/README.md
+++ b/README.md
@@ -0,0 +1,28 @@
 
				+# Gnews Keyword Extraction
			
 
				+**Extract keywords in target domain news from DB (gnews.gnews_detail).**
			
 
				+
			
 
				+First, we use Transformer to get the news semantic vector, and use HDBSCAN for clustering.
			
 
				+Then we can use [predict_doc.txt](predict_doc.txt) to predict which cluster the target domain may be in.
			
 
				+Finally, we used unsupervised algorithms such as RAKE, TF-IDF, TextRank, and MultipartiteRank to extract the keywords of news in this cluster and store them in [tag_list.csv](predict_doc.txt).
			
 
				+
			
 
				+
			
 
				+## Install
			
 
				+```bash
			
 
				+pip install -r requirements.txt
			
 
				+```
			
 
				+Note: See [requirements.txt](requirements.txt) for more details.
			
 
				+
			
 
				+## Usage
			
 
				+### Step 1
			
 
				+
			
 
				+Copy the news content of the target domain and paste it to [predict_doc.txt](predict_doc.txt).
			
 
				+
			
 
				+### Step 2
			
 
				+```bash
			
 
				+python gnews_keyword_extraction.py --topK 80 --target_domain_range 1
			
 
				+```
			
 
				+* `--topK` *(optional)* : – Get the top K keywords. (Default: `80`)
			
 
				+* `--target_domain_range` *(optional)*  : – Candidate range of target domain. `0` means only one target cluster is used, `1` means a total of three clusters are used from the side. (Default: `1`)
			
 
				+
			
 
				+### Step 3
			
 
				+Get keywords from [tag_list.csv](predict_doc.txt).
			
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,13 +1,13 @@
 
				-OpenCC==1.1.1.post1
			
 
				-sentence_transformers==2.0.0
			
 
				-tqdm==4.61.2
			
 
				+dataset==1.5.0
			
 
				 pandas==1.2.4
			
 
				-pke==1.8.1
			
 
				+paddlepaddle==2.1.1
			
 
				+sentence_transformers==2.0.0
			
 
				 jieba==0.42.1
			
 
				-nltk==3.6.2
			
 
				-dataset==1.5.0
			
 
				 PyMySQL==1.0.2
			
 
				+pke==1.8.1
			
 
				+OpenCC==1.1.1.post1
			
 
				+nltk==3.6.2
			
 
				 torch==1.8.1+cu111
			
 
				-paddlepaddle==2.1.1
			
 
				+tqdm==4.61.2
			
 
				 hdbscan==0.8.27
			
 
				 paddle==1.0.2