No such file or directory: './data/text_dict' #2

GabrielLin · 2021-04-11T16:24:54Z

When I run the generate_dataset.py, the following error was shown. I am afraid that there is nothing in the data folder. Could you help? Thanks.

Reading text vocabulary from './data/text_dict'...
Traceback (most recent call last):
  File "generate_dataset.py", line 301, in <module>
    dicts['text'] = initVocabulary('text', None, './data/text_dict', 50000, ' ', False)
  File "generate_dataset.py", line 283, in initVocabulary
    vocab.loadFile(vocabFile)  
  File "generate_dataset.py", line 73, in loadFile
    for line in open(filename):
FileNotFoundError: [Errno 2] No such file or directory: './data/text_dict'

The text was updated successfully, but these errors were encountered:

abhigupta768 · 2021-04-24T04:02:11Z

You need to first build the text_dict, authors_dict, and the pvs_dict.

Change the lines 300 - 312 in the generate_dataset.py to the following:

dicts['text'] = initVocabulary('text', ['./data/abstract_train.p', './title_train.p'], None, 50000, ' ', False)
dicts['authors'] = initVocabulary('authors', './data/authors_train.p', None, 20000, ' ', False)

saveVocabulary('text', dicts['text'], './data/text_dict')
saveVocabulary('authors', dicts['authors'], './data/authors_dict')

pvs=p.load(open("./pv_train.p","rb"))
unique_pvs=np.unique(np.array(pvs))
dicts['pvs']=Dict()
for pv in unique_pvs:
   dicts['pvs'].add(pv)
dicts['pvs'] = dicts['pvs'].prune(300)
saveVocabulary('pvs', dicts['pvs'], './data/pvs_dict')

After you have built these files once, you can revert the code back to the original.

Please let me know if you have any questions. Thanks!

GabrielLin · 2021-04-29T03:29:53Z

I follow your instructions and the script has been run for more than two days but no output. I will try to run from the very beginning again. It might cost a little time. Thanks.

GabrielLin · 2021-05-12T02:18:31Z

Sorry for the late reply. I have tried some many times. But the modified script just keep running but did not end.

abhigupta768 · 2021-05-17T18:59:40Z

Hey, sorry for the delay in reply. Can you provide some sample data that you are giving as input to the scripts?

If you can provide some sample data, I will look into it over the weekend and find the issue.

Thanks!

GabrielLin · 2021-05-18T12:34:14Z

Thank you. Here is the step I wrote.

My Readme

This repository contains code for the Modular-Hierarchical Attention Based Scholarly Venue Recommender System using Deep Learning

Checking on Ubuntu 16.04.4 LTS

Ref Repo

Updated to https://github.com/abhigupta768/publication-venue-recommender/tree/530702eb0552aafb8f8517b329579610e1a7aa81

Ref Paper

Pradhan, T., Gupta, A., & Pal, S. (2020). HASVRec: A modularized hierarchical attention-based scholarly venue recommender system. Knowledge-Based Systems, 204, 106181. doi:10.1016/j.knosys.2020.106181

Dependencies

Python Environment

conda create -c conda-forge -n py36gpvr python=3.6
conda activate py36gpvr

conda install -c pytorch pytorch==1.7.1 torchvision==0.8.2 cudatoolkit==10.2.89 cudnn==7.6.5

conda install -c conda-forge nltk==3.6.1
conda install -c conda-forge pandas==1.1.5

Data

Download AMiner-Paper.rar from https://lfs.aminer.cn/lab-datasets/aminerdataset/AMiner-Paper.rar;
The above file is converted into zip format and stored outside;

Unzip the zip file and place it at project root

unzip AMiner-Paper.zip

Extract

python extract_data.py

move all .p files to ./data folder

cp *.p data/

Generate the PyTorch compatible format

nohup python -u generate_dataset_init.py > generate_dataset_init.log 2>&1 &
tailf generate_dataset_init.log

GabrielLin · 2021-05-18T12:38:00Z

In the above step, generate_dataset_init.py is the file I modified by your suggestions. The content is

# -*- coding: utf-8 -*-
"""pubrec-generate-dataset.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1yowkK19M7YGLZTbjX7g4dTEaQphvp0a_
"""

# Commented out IPython magic to ensure Python compatibility.
# The following two lines are commented by me
# from google.colab import drive
# drive.mount('/gdrive')
# %cd /gdrive

# Commented out IPython magic to ensure Python compatibility.
# %cd My\ Drive/pubrec

import torch
import torch.utils.data as torch_data

import time
import csv
import json as js
import os
import codecs
import pickle as p

import numpy as np

import nltk
nltk.download('punkt')

PAD = 0
UNK = 1
BOS = 2
EOS = 3

PAD_WORD = '<blank>' 
UNK_WORD = 'UNK'
BOS_WORD = '<s>'
EOS_WORD = '</s>'
SPA_WORD = ' '

def flatten(l): 
    for el in l:
        if hasattr(el, "__iter__"):
            for sub in flatten(el):
                yield sub
        else:
            yield el

class Dict(object):
    def __init__(self, data=None, lower=False):
        self.idxToLabel = {}
        self.labelToIdx = {}
        self.frequencies = {}
        self.lower = lower
        # Special entries will not be pruned.
        self.special = [] 

        if data is not None:
            if type(data) == str:
                self.loadFile(data)
            else:
                self.addSpecials(data)

    def size(self):
        return len(self.idxToLabel)

    # Load entries from a file.
    def loadFile(self, filename):
        for line in open(filename):
            fields = line.split()
            label = ' '.join(fields[:-1])
            idx = int(fields[-1])
            self.add(label, idx)

    # Write entries to a file.
    def writeFile(self, filename):
        with open(filename, 'w') as file:
            for i in range(self.size()):
                label = self.idxToLabel[i]
                file.write('%s %d\n' % (label, i))

        file.close()

    def loadDict(self, idxToLabel):
        for i in range(len(idxToLabel)):
            label = idxToLabel[i]
            self.add(label, i)

    def lookup(self, key, default=None):
        key = key.lower() if self.lower else key
        try:
            return self.labelToIdx[key]
        except KeyError:
            return default

    def getLabel(self, idx, default=None):
        try:
            return self.idxToLabel[idx]
        except KeyError:
            return default

    # Mark this `label` and `idx` as special (i.e. will not be pruned).
    def addSpecial(self, label, idx=None):
        idx = self.add(label, idx)
        self.special += [idx]

    # Mark all labels in `labels` as specials (i.e. will not be pruned).
    def addSpecials(self, labels):
        for label in labels:
            self.addSpecial(label)

    # Add `label` in the dictionary. Use `idx` as its index if given.
    def add(self, label, idx=None):
        label = label.lower() if self.lower else label
        if idx is not None:
            self.idxToLabel[idx] = label
            self.labelToIdx[label] = idx
        else:
            if label in self.labelToIdx:
                idx = self.labelToIdx[label]
            else:
                idx = len(self.idxToLabel)
                self.idxToLabel[idx] = label
                self.labelToIdx[label] = idx

        if idx not in self.frequencies:
            self.frequencies[idx] = 1
        else:
            self.frequencies[idx] += 1

        return idx

    # Return a new dictionary with the `size` most frequent entries.
    def prune(self, size):
        if size >= self.size():
            return self

        # Only keep the `size` most frequent entries.
        freq = torch.Tensor(
                [self.frequencies[i] for i in range(len(self.frequencies))])
        _, idx = torch.sort(freq, 0, True)
        newDict = Dict()
        newDict.lower = self.lower

        # Add special entries in all cases.
        for i in self.special:
            newDict.addSpecial(self.idxToLabel[i])

        for i in idx[:size]:
            newDict.add(self.idxToLabel[i.item()])

        return newDict

    # Convert `labels` to indices. Use `unkWord` if not found.
    # Optionally insert `bosWord` at the beginning and `eosWord` at the .
    def convertToIdx(self, labels, unkWord, bosWord=None, eosWord=None):
        vec = []

        if bosWord is not None:
            vec += [self.lookup(bosWord)]

        unk = self.lookup(unkWord)
        vec += [self.lookup(label, default=unk) for label in labels]

        if eosWord is not None:
            vec += [self.lookup(eosWord)]

        vec = [x for x in flatten(vec)]

        return torch.LongTensor(vec)

    # Convert `idx` to labels. If index `stop` is reached, convert it and return.
    def convertToLabels(self, idx, stop):
        labels = []

        for i in idx:
            if i == stop:
                break
            labels += [self.getLabel(i)]

        return labels

class AttrDict(dict):

    def __init__(self, *args, **kwargs):
        super(AttrDict, self).__init__(*args, **kwargs)
        self.__dict__ = self

def read_config(path):
    return AttrDict(js.load(open(path, 'r')))


def format_time(t):
    return time.strftime("%Y-%m-%d-%H:%M:%S", t)


def logging(file):
    def write_log(s):
        print(s, end='')
        with open(file, 'a') as f:
            f.write(s)
    return write_log


def logging_csv(file):
    def write_csv(s):
        with open(file, 'a', newline='') as f:
            writer = csv.writer(f)
            writer.writerow(s)
    return write_csv

class dataset(torch_data.Dataset):

    def __init__(self, text_data, label_data):
        self.text_data = text_data
        self.label_data = label_data
  
    def __getitem__(self, index):
        return [torch.from_numpy(x[index]).type(torch.FloatTensor) for x in self.text_data],\
               torch.from_numpy(self.label_data[index]).type(torch.FloatTensor)

    def __len__(self):
        return len(self.label_data)
       

def get_loader(dataset, batch_size, shuffle, num_workers):

    data_loader = torch.utils.data.DataLoader(dataset=dataset,
                                              batch_size=batch_size,
                                              shuffle=shuffle,
                                              num_workers=num_workers)
    return data_loader

def makeVocabulary(filename, size, sep=' ', char=False):

    vocab = Dict([PAD_WORD, UNK_WORD], lower=True)
    if char:
        vocab.addSpecial(SPA_WORD)

    lengths = []

    if type(filename) == list:
        for _filename in filename:
            data = p.load(open(_filename,"rb"))
            for sent in data:
                for word in sent.strip().split(sep):
                    lengths.append(len(word))
                    if char:
                        for ch in word.strip():
                            vocab.add(ch)
                    else:
                        vocab.add(word.strip())
    else:
        data = p.load(open(filename,"rb"))
        for sent in data:
            for word in sent.strip().split(sep):
                lengths.append(len(word))
                if char:
                    for ch in word.strip():
                        vocab.add(ch)
                else:
                    vocab.add(word.strip())
    print('max: %d, min: %d, avg: %.2f' % (max(lengths), min(lengths), sum(lengths)/len(lengths)))

    originalSize = vocab.size()
    vocab = vocab.prune(size)  
    print('Created dictionary of size %d (pruned from %d)' %
          (vocab.size(), originalSize))

    return vocab

def initVocabulary(name, dataFile, vocabFile, vocabSize, sep=' ', char=False):

    vocab = None
    if vocabFile is not None:
        # If given, load existing word dictionary.
        print('Reading ' + name + ' vocabulary from \'' + vocabFile + '\'...')
        vocab = Dict()
        vocab.loadFile(vocabFile)  
        print('Loaded ' + str(vocab.size()) + ' ' + name + ' words')

    if vocab is None:
        # If a dictionary is still missing, generate it.
        print('Building ' + name + ' vocabulary...')
        genWordVocab = makeVocabulary(dataFile, vocabSize, sep=sep, char=char)  
        vocab = genWordVocab

    return vocab


def saveVocabulary(name, vocab, file):

    print('Saving ' + name + ' vocabulary to \'' + file + '\'...')
    vocab.writeFile(file)

dicts = {}
dicts['text'] = initVocabulary('text', ['./data/abstract_train.p', './title_train.p'], None, 50000, ' ', False)
dicts['authors'] = initVocabulary('authors', './data/authors_train.p', None, 20000, ' ', False)

saveVocabulary('text', dicts['text'], './data/text_dict')
saveVocabulary('authors', dicts['authors'], './data/authors_dict')

pvs=p.load(open("./pv_train.p","rb"))
unique_pvs=np.unique(np.array(pvs))
dicts['pvs']=Dict()
for pv in unique_pvs:
   dicts['pvs'].add(pv)
dicts['pvs'] = dicts['pvs'].prune(300)
saveVocabulary('pvs', dicts['pvs'], './data/pvs_dict')

dicts['pvs'] = initVocabulary('pvs', None, './data/pvs_dict', 50000, ' ', False)

abstract = {'text_file': './abstract_train.p', 'text_dict': dicts['text'], 'doc_len': 8, 'text_len': 20}
title = {'text_file': './title_train.p', 'text_dict': dicts['text'], 'text_len': 20}
authors = {'text_file': './authors_train.p', 'text_dict': dicts['authors'], 'text_len': 7}
pv = {'pv_file': './pv_train.p', 'pv_dict': dicts['pvs']}

abstract_val = {'text_file': './abstract_val.p', 'text_dict': dicts['text'], 'doc_len': 8, 'text_len': 20}
title_val = {'text_file': './title_val.p', 'text_dict': dicts['text'], 'text_len': 20}
authors_val = {'text_file': './authors_val.p', 'text_dict': dicts['authors'], 'text_len': 7}
pv_val = {'pv_file': './pv_val.p', 'pv_dict': dicts['pvs']}

abstract_test = {'text_file': './abstract_test.p', 'text_dict': dicts['text'], 'doc_len': 8, 'text_len': 20}
title_test = {'text_file': './title_test.p', 'text_dict': dicts['text'], 'text_len': 20}
authors_test = {'text_file': './authors_test.p', 'text_dict': dicts['authors'], 'text_len': 7}
pv_test = {'pv_file': './pv_test.p', 'pv_dict': dicts['pvs']}

def make_data(abstract, title, authors, pv):

    text_data = []
    text_data.append(make_abstract_data(abstract['text_file'], abstract['text_dict'], abstract['doc_len'], abstract['text_len']))
    text_data.append(make_title_data(title['text_file'], title['text_dict'], title['text_len']))
    text_data.append(make_author_data(authors['text_file'], authors['text_dict'], authors['text_len'], sep=';'))
    pv_data = make_pv_data(pv['pv_file'], pv['pv_dict'])

    return dataset(text_data, pv_data)

def make_abstract_data(text_file, text_dict, doc_length, text_length, sep=' '):
    result = []
    data=p.load(open(text_file,"rb"))
    for line in data:
        temp = np.zeros((doc_length, text_length))
        sents=nltk.sent_tokenize(line)
        for i in range(len(sents)):
            if i < doc_length:
                words = nltk.word_tokenize(sents[i].strip())
                for j in range(len(words)):
                    if j < text_length:
                        temp[i, j] = text_dict.lookup(words[j].lower(), 1)
        result.append(temp)
    return result

def make_title_data(text_file, text_dict, text_length, sep=' '):
    result = []
    data=p.load(open(text_file,"rb"))
    for line in data:
        temp = np.zeros(text_length)
        words = nltk.word_tokenize(line.strip())
        for i in range(len(words)):
            if i < text_length:
                temp[i] = text_dict.lookup(words[i].lower(), 1)
        result.append(temp)
    return result

def make_author_data(text_file, text_dict, text_length, sep=' '):
    result = []
    data=p.load(open(text_file,"rb"))
    for line in data:
        temp = np.zeros(text_length)
        words = line.strip().split(sep)
        for i in range(len(words)):
            if i < text_length:
                temp[i] = text_dict.lookup(words[i].lower(), 1)
        result.append(temp)
    return result

def make_pv_data(pv_file, pv_dict):
    result = []
    length = len(pv_dict.idxToLabel)
    data=p.load(open(pv_file,"rb"))
    for line in data:
        temp = np.zeros(length)
        temp[pv_dict.lookup(str(line), 1)] = 1
        result.append(temp)
    return result

train = make_data(abstract, title, authors, pv)
val = make_data(abstract_val, title_val, authors_val, pv_val)
test = make_data(abstract_test, title_test, authors_test, pv_test)

data = {'train': train, 'val': val, 'test': test}
torch.save(data, './data/final_data_3')

# added by me
print('DONE.')

Sakura718 · 2024-08-27T03:58:55Z

Sorry for the late reply. I have tried some many times. But the modified script just keep running but did not end.

Hello, I have encountered the same problem as you. Although it has been a while, I would like to ask if you have solved this problem and how it was resolved. Thank you！

Sakura718 · 2024-08-27T04:11:24Z

Sorry for the late reply. I have tried some many times. But the modified script just keep running but did not end.

Hello, I have encountered the same problem as you. Although it has been a while, I would like to ask if you have solved this problem and how it was resolved. Thank you！

Oh I have already solved it, it is due to the external network proxy! thank you~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No such file or directory: './data/text_dict' #2

No such file or directory: './data/text_dict' #2

GabrielLin commented Apr 11, 2021

abhigupta768 commented Apr 24, 2021

GabrielLin commented Apr 29, 2021

GabrielLin commented May 12, 2021

abhigupta768 commented May 17, 2021 •

edited

Loading

GabrielLin commented May 18, 2021 •

edited

Loading

GabrielLin commented May 18, 2021

Sakura718 commented Aug 27, 2024

Sakura718 commented Aug 27, 2024

No such file or directory: './data/text_dict' #2

No such file or directory: './data/text_dict' #2

Comments

GabrielLin commented Apr 11, 2021

abhigupta768 commented Apr 24, 2021

GabrielLin commented Apr 29, 2021

GabrielLin commented May 12, 2021

abhigupta768 commented May 17, 2021 • edited Loading

GabrielLin commented May 18, 2021 • edited Loading

My Readme

Ref Repo

Ref Paper

Dependencies

Data

Unzip the zip file and place it at project root

Extract

Generate the PyTorch compatible format

GabrielLin commented May 18, 2021

Sakura718 commented Aug 27, 2024

Sakura718 commented Aug 27, 2024

abhigupta768 commented May 17, 2021 •

edited

Loading

GabrielLin commented May 18, 2021 •

edited

Loading