Please Login or Register

Knowledgebase

Cleaning Special Characters from Product Text Files

Ever load a product page on your site and see strange characters in your product's text description? Typically this is from copying and pasting the product's text from an MS Word document. It's a subtle problem that can make your pages look unprofessional to your customers.

The clean_prod_text script was written to clean out and special characters that have been inadvertantly added to a PDG Commerce or Shop. It will look in each file for the offending characters and translate them as appropriate. Bullets, quotes and ellipses will all be converted to an htmlentity that will look good on the page.

You can run the script once, or you can set it up to run on a regular schedule using your web server's cron service.

 

  1. Copy the text below in to Notepad or your favorite editor (i.e. not MS Word)
  2. Save it as a text file named clean_prod_text.py
  3. Upload / FTP the file into your server's cgi-bin folder
  4. Make sure the user and group of the file are appropriate to your server
  5. Make sure the file's permissions are set to 755 or rwxr-xr-x
  6. The script assumes that your product text files are in the standard location: cgi-bin/PDG_Commerce/ProdText/
  7. Backup / make a copy of your ProdText folder, just in case....
  8. You can now run the script from the web server, from the command line or from your server's cron system.
    1. The easiest way to test would be to run it from the web browser; if you have a lot of products with a lot of text, then this may take a while.
    2. To run from the web browser, you would use something like the following: http://yourdomain.com/cgi-bin/clean_prod_text.py

 

#!/usr/bin/python
#
# clean_prod_text.py
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# Clean Product Text -- a Python script to remove special characters,
# commonly added when pasting text from a MS Word document, into a
# product's Specific Product Text using PDG Administrator
#
# usage: install in your web server's cgi-bin folder. Make sure the script's 
# permissions are correct for your installation. You can then run the script 
# from the URL, from the command line or from a cron script.
#
# The script assumes that your product text files are in the standard 
# location: cgi-bin/PDG_Commerce/ProdText/
#
# From the URL: http://domain.com/cgi-bin/clean_prod_text.py
# From the command line: python clean_prod_text.py
# From cron: wget http://domain.com/cgi-bin/clean_prod_text.py
#
# Running the script will remove special characters from the text files
# and replace them with whitespace. You may edit the script to replace
# carriage returns and other special characters, 
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# Developed by Hen's Teeth Network for the PDG Commerce Community
#
# This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 
# Unported License. To view a copy of this license, visit 
# http://creativecommons.org/licenses/by-sa/3.0/ 
# or send a letter to Creative Commons, 444 Castro Street, Suite 900, 
# Mountain View, California, 94041, USA.
#
# The code to iterate over and replace text in files -- primarily the code to 
# create and use temp files -- is from Thomas Watnedal's answer to the following
# Stack Overflow question, which is also licensed under a Creative Commons
# Attribution-ShareAlike license.
# http://stackoverflow.com/questions/39086/search-and-replace-a-line-in-a-file-in-python
# http://stackoverflow.com/users/4059/thomas-watnedal
#
# Version 1.0 - October 2011
#


import os
import re
import sys
from tempfile import mkstemp
from shutil import move
from os import remove, close

def cleanText(text):
    # Replace non-ASCII characters with printable ASCII. 
    # Use HTML entities when possible
    if None == text:
        return ''

    
    text = re.sub(r'\x85', '…', text) # replace ellipses
    text = re.sub(r'\x91', "‘", text)  # replace left single quote
    text = re.sub(r'\x92', "’", text)  # replace right single quote
    text = re.sub(r'\x93', '“', text)  # replace left double quote
    text = re.sub(r'\x94', '”', text)  # replace right double quote
    text = re.sub(r'\x95', '•', text)   # replace bullet
    text = re.sub(r'\x96', '-', text)        # replace bullet
    text = re.sub(r'\x99', '™', text)  # replace TM
    text = re.sub(r'\xae', '®', text)    # replace (R)
    text = re.sub(r'\xb0', '°', text)    # replace degree symbol
    text = re.sub(r'\xba', '°', text)    # replace degree symbol

    # Do you want to keep new lines / carriage returns? These are generally 
    # okay and useful for readability
    #text = re.sub(r'[\n\r]+', ' ', text)     # remove embedded \n and \r

    # This is a hard-core line that strips everything else.
    text = re.sub(r'[\x00-\x1f\x80-\xff]', ' ', text)

    return text

    
print("Content-type: text/plain");
print("");
print("clean_prod_text");

prod_text_path = 'PDG_Commerce/ProdText/'
listing = os.listdir(prod_text_path)

for infile in listing:

    print "current file is: " + infile

    #Create temp file
    fh, tmp_path = mkstemp()
    new_file = open(tmp_path,'w')
    old_file = open(prod_text_path + infile)

    for line in old_file:
        new_file.write(cleanText(line))

    #close temp file
    new_file.close()
    close(fh)
    old_file.close()

    #Remove original file
    remove(prod_text_path + infile)

    #Move new file
    move(tmp_path, prod_text_path + infile)
 


Was this answer helpful?

Add to Favourites Add to Favourites

Print this Article Print this Article

Also Read

Language:

Quick Navigation


Secure Site


Client Login

Email

Password

Remember Me

Search