Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Mean F1 score (cross-validation k=5) with different feature engineering methods_
| Model | F1_weighted (%) |
|---|---|
| LinearSVC (baseline) | 60.8 |
| CNN | 50 |
| DistilBERT Multilingual | 50.1 |
| BiLSTM | 57 |
| ULMFiT | 57 |
| BERT-Base Multilingual | 59.5 |
| BERTimbau | 63.6 |
Precision, Recall and F1 score by Section_
| Section | Precision | Recall | F1 |
|---|---|---|---|
| A | 0.7923 | 0.7 | 0.7433 |
| H | 0.6805 | 0.7248 | 0.7019 |
| C | 0.6515 | 0.7427 | 0.6941 |
| E | 0.5981 | 0.5923 | 0.5952 |
| F | 0.5446 | 0.5393 | 0.5419 |
| B | 0.5291 | 0.5388 | 0.5339 |
| G | 0.4903 | 0.4633 | 0.4764 |
| D | 0.5098 | 0.4041 | 0.4509 |
Patent classification related studies_
| Authors | Feature Engineering | Algorithm | Section | Language | Dataset size | Number of classes |
|---|---|---|---|---|---|---|
| (Trappey et al., 2006) | Key phrases frequency based on TF-IDF | Neural Networks | full document | English | 300 training | 9 |
| (Derieux et al., 2010) | Terms extraction and semantic relation | SVM | full document | English, German, French | 985 training | 630 |
| (Trappey et al., 2013) | Key phrases frequency based on TF-IDF | Ontology-Based Neural Network | full document | English | 333 training | 23 |
| (Zhang, 2014) | - | SVM | - | English | 5000 | 5 |
| (Wu et al., 2016) | SOM, KPCA | SVM | full document | English | 60.000 | 7 |
| (Li et al., 2018) | Skip-gram | CNN | title and abstract | English | 742.097 training 1350 test | 637 |
| (Risch & Krestel, 2019) | Domain-specific FastText word embeddings | Bi-directional GRU | title and abstract | English | ~1.7M training | 637 |
| (Abdelgawad et al., 2020) | GloVe, Word2Vec, FastText | Hierarchical SVM and CNN with BOHB (Bayesian Optimization hyperband) | title, abstract, description, and claims | English | 75.000 training | 451 |
| (Lee & Hsiang, 2020) | - | BERT-Base | claims | English | 1,950,247 training | 632 |
F1 score on the test set_
| Model | F1_weighted (%) |
|---|---|
| LinearSVC (baseline) | 60.8 |
| CNN | 50 |
| DistilBERT Multilingual | 50.1 |
| BiLSTM | 57 |
| ULMFiT | 57 |
| BERT-Base Multilingual | 59.5 |
| BERTimbau | 63.6 |
IPC Areas of Technology_
| Section | Description |
|---|---|
| A | Human Necessities |
| B | Performing Operations; Transporting |
| C | Chemistry; Metallurgy |
| D | Textiles; Paper |
| E | Fixed Constructions |
| F | Mechanical Engineering; Lighting; Heating; Weapons; Blasting Engines or Pumps |
| G | Physics |
| H | Electricity |
Features used in the analysis_
| Feature | Description |
|---|---|
| id | Patent internal identification |
| Title | Descriptive name of the patent |
| Claims | The legal scope of the invention, including delimitations and application field |
| Abstract | A brief description of the invention presented in the patent |
| Section | IPC 1st level classification code |
| Class | IPC 2nd level classification code |
| Subclass | IPC 3rd level classification code |
| Main group | IPC 4th level classification code |
| Subgroup | IPC 5th level classification code |