ccc 3 éve
szülő
commit
dedb5e01c5
2 módosított fájl, 35 hozzáadás és 7 törlés
  1. 28 0
      README.md
  2. 7 7
      requirements.txt

+ 28 - 0
README.md

@@ -0,0 +1,28 @@
+# Gnews Keyword Extraction
+**Extract keywords in target domain news from DB (gnews.gnews_detail).**
+
+First, we use Transformer to get the news semantic vector, and use HDBSCAN for clustering.
+Then we can use [predict_doc.txt](predict_doc.txt) to predict which cluster the target domain may be in.
+Finally, we used unsupervised algorithms such as RAKE, TF-IDF, TextRank, and MultipartiteRank to extract the keywords of news in this cluster and store them in [tag_list.csv](predict_doc.txt).
+
+
+## Install
+```bash
+pip install -r requirements.txt
+```
+Note: See [requirements.txt](requirements.txt) for more details.
+
+## Usage
+### Step 1
+
+Copy the news content of the target domain and paste it to [predict_doc.txt](predict_doc.txt).
+
+### Step 2
+```bash
+python gnews_keyword_extraction.py --topK 80 --target_domain_range 1
+```
+* `--topK` *(optional)* : – Get the top K keywords. (Default: `80`)
+* `--target_domain_range` *(optional)*  : – Candidate range of target domain. `0` means only one target cluster is used, `1` means a total of three clusters are used from the side. (Default: `1`)
+
+### Step 3
+Get keywords from [tag_list.csv](predict_doc.txt).

+ 7 - 7
requirements.txt

@@ -1,13 +1,13 @@
-OpenCC==1.1.1.post1
-sentence_transformers==2.0.0
-tqdm==4.61.2
+dataset==1.5.0
 pandas==1.2.4
-pke==1.8.1
+paddlepaddle==2.1.1
+sentence_transformers==2.0.0
 jieba==0.42.1
-nltk==3.6.2
-dataset==1.5.0
 PyMySQL==1.0.2
+pke==1.8.1
+OpenCC==1.1.1.post1
+nltk==3.6.2
 torch==1.8.1+cu111
-paddlepaddle==2.1.1
+tqdm==4.61.2
 hdbscan==0.8.27
 paddle==1.0.2