• 0
Votes
name

My Python script is exiting with OOO memory errors when processing a large file. (Works fine for small subset of records ~10K)

I'm processing 2 files:

  • companies.csv (File size 19MB) ~43K records
  • competitor_companies.csv (File size 427MB) ~4.5M records

In file 1, I have a field called uuid.

I need to compare:

  1. If company_name is the same in file1 and file2, copy uuid field from file 1 into competitor_companies dataframe.

  2. If website is the same in file1 and file2 , copy uuid field from file 1 into competitor_companies dataframe.

When I process the files in my server (~30 GB RAM), script gets stuck in this line:

logging.info('Matching TLD.')
match_tld = competitor_companies.tld.isin(companies.tld)

Then script stops and I see this line in /var/log/syslog:

Out of memory: Kill process 177106 (company_generat) score 923 or sacrifice child

Python code:

def MatchCompanies(
    companies: pandas.Dataframe,
    competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
  """Find Competitor companies in companies dataframe and generate a new list.

  Args:
    companies: A dataframe with company information from CSV file.
    competitor_companies: A dataframe with Competitor information from CSV file.

  Returns:
    A sequence of matched companies and their UUID.

  Raises:
    ValueError: No companies found.
  """

  if _IsEmpty(companies):
    raise ValueError('No companies found')
  # Clean up empty fields.
  companies = companies.fillna('')
  logging.info('Found: %d records.', len(competitor_companies))
  competitor_companies = competitor_companies.fillna('')
  # Create a column to define if we found a match or not.
  competitor_companies['match'] = False
  # Add Top Level Domain (tld) column to compare matching companies.
  companies.rename(columns={'website': 'tld'}, inplace=True)
  logging.info('Cleaning up company name.')
  companies.company_name = companies.company_name.apply(_NormalizeText)
  competitor_companies.company_name = competitor_companies.company_name.apply(
      _NormalizeText)
  # Create a new column since AppAnnie already contains TLD in company_url.
  competitor_companies.rename(columns={'company_url': 'tld'}, inplace=True)
  logging.info('Matching TLD.')
  match_tld = competitor_companies.tld.isin(companies.tld)
  logging.info('Matching Company Name.')
  match_company_name = competitor_companies.company_name.isin(
      companies.company_name)
  # Updates match column if TLD or company_name or similar companies matches.
  competitor_companies['match'] = match_tld | match_company_name
  # Extracts UUID for TLD matches.
  logging.info('Extracting UUID')
  merge_tld = competitor_companies.merge(
      companies[['tld', 'uuid']], on='tld', how='left')
  # Extracts UUID for company name matches.
  merge_company_name = competitor_companies.merge(
      companies[['company_name', 'uuid']], on='company_name', how='left')
  # Combines dataframes.
  competitor_companies['uuid'] = merge_tld['uuid'].combine_first(
      merge_company_name['uuid'])
  match_companies = len(competitor_companies[competitor_companies['match']])
  total_companies = len(competitor_companies)
  logging.info('Results found: %d out of %d', match_companies, total_companies)
  competitor_companies.drop('match', axis=1, inplace=True)
  competitor_companies.rename(columns={'tld': 'company_url'}, inplace=True)
  return competitor_companies

This is how I read the files:

def LoadDataSet(filename: str) -> pandas.Dataframe:
  """Reads CSV file where company information is stored.

  Header information exists in CSV file.

  Args:
    filename: Source CSV file. Header is present in file.

  Returns:
    A pandas dataframe with company information.

  Raises:
     FileError: Unable to read filename.
  """
  with open(filename) as input_file:
    data = input_file.read()
    dataframe = pandas.read_csv(
        io.BytesIO(data), header=0, low_memory=False, memory_map=True)
    return dataframe.where((pandas.notnull(dataframe)), None)

Looking for suggestions how to improve my code?

Top command results when running:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                             
190875 myuser   20   0 4000944   2.5g 107532 R 100.7   8.5   5:01.93 company_generat   

Why don't you use pd.merge directly?

You could create two dataframes, one for company_name matching, the second for websitematching and then left merge competitor_companies on each of these dataframes.

# Create 2 matching tables
c_website = companies[['uuid', 'website']].rename(columns={'uuid': 'uuid_from_website'})
c_name = companies[['uuid', 'company_name']].rename(columns={'uuid': 'uuid_from_name'})

# Merge on each of these tables
result = competitor_companies
.merge(c_website, how='left', on='website')
.merge(c_name, how='left', on='company_name')

You then need to reconciliate those two values, for instance giving priority to uuid_from_name:

result['uuid'] = np.where(res.uuid_from_name.notnull(), res.uuid_from_name, res.uuid_from_website)
del result['uuid_from_name']
del result['uuid_from_website']

It should be much faster than using pd.Series.isin.

  • 1
Reply Report