Newsgroups: fj.comp.texhax
Path: galaxy.trc.rwcp.or.jp!jaist-news!cs.titech!nirvana.cs.titech!wnoc-tyo-news!aist-nara!ie.u-ryukyu.ac.jp!hakata!kyu-cs!lab!kazama
From: kazama@square.ntt.jp (Kazuhiro Kazama)
Subject: Re: Postscript to text or TeX file?
In-Reply-To: hiroshi@cgate.hipecs.hokudai.ac.jp's message of 23 Mar 1994 00:38:19 JST
Message-ID: <ic5.1tgc4v@lab.ntt.jp>
Sender: news@lab.ntt.jp
Organization: NTT Basic Research Laboratories, Kanagawa, Japan.
References: <2mn3db$qhf@nameserv.sys.hokudai.ac.jp>
Date: 24 Mar 1994 09:04:23 GMT
Lines: 66
Xref: galaxy.trc.rwcp.or.jp fj.comp.texhax:4715
X-originally-archived-at: http://galaxy.rwcp.or.jp/text/cgi-bin/newsarticle2?ng=fj.comp.texhax&nb=4715&hd=a
X-reformat-date: Mon, 18 Oct 2004 15:18:22 +0900
X-reformat-comment: Tabs were expanded into 4 column tabstops by the Galaxy's archiver. See http://katsu.watanabe.name/ancientfj/galaxy-format.html for more info.

$BIw4V!w#N#T#T4pAC8&$G$9!#(B

Subject: 3.10 How can I convert PostScript to ASCII? 

    In general, when you say ``I want to convert PostScript to ASCII'' 
    what you really mean is ``I want to convert MacWrite (which makes 
    PostScript output) to ASCII'' or ``I want to convert somebody's TeX 
    document (which I have in PostScript) to ASCII''. 

    Unfortunately, programs like these (if they're smart) do a lot of 
    fancy stuff like kerning, which means that where they would 
    normally execute the postscript command for 

  
      ``print water fountain''
  
    instead they execute the postscript command for 

  
      ``print wat''      (move a little to get the spacing *just* right)
      ``print er''       (move a little to get the spacing *just* right)
      ``print foun''     (move a little to get the spacing *just* right)
      ``print tain''     (move a little to get the spacing *just* right)
  
    So if I write a program to look through a PostScript file for 
    strings, like ps2ascii.pl, It can't tell where the words really 
    end. Here my program would see 4 strings 

  
  ``wat'' ``er'' ``foun'' ``tain''
  
    And it doesn't see any difference between the spacing between 
    ``found'' and ``tain'' (not a word break) and the spacing between 
    ``er'' and ``foun'' (a real word break). 

    The problem is that PostScript for text formatting is usually 
    produced machine generated by a text formatter. A PostScript 
    generator like dvips might have a special command like ``boop'' 
    that differentiates between a real world break and a fake one. But 
    every text formatter that generates PostScript has their own name 
    for the ``boop'' command. 

    So you really want a ``PostScript to ASCII converter for dvips 
    output''. 

    The only general solution I can see would be to redefine the show 
    operator to print out the currentpoint for every letter being 
    printed, like gs2asc, and then make up an ASCII page based on this 
    by sticking ASCII characters where they go in a two-dimensional 
    array. That would convert PostScript to ASCII ``formatted''. 

    But even that wouldn't solve the problem, because special bitmap 
    fonts and and standard fonts like Symbol don't always print a ``P'' 
    when you say the letter ``P''. Sometimes they print the greek Pi 
    symbol or a chess piece or a ZapfDingBat. 

    Use ps2a, ps2ascii, ps2txt, ps2ascii.ps or ps2ascii.pl. 

$B!!(BTeX$B$N>l9g$@$H$9$k$H!";DG0$J$,$i%(%s%3!<%G%#%s%0$,0c$&$s$G!"=PNO$OB?(B
$B>/$*$+$7$/$J$k$H;W$$$^$9!#(B

$B!!$b$7(Bdvi$B$,$"$l$P(Bdvi2tex$B$H$$$&$N$b$"$k$s$G$9$,!"$"$l$O8x3+$5$l$?$N$+$J!)(B
--
$BIw4V(B $B0lMN(B (kazama@square.ntt.jp)
NTT$B4pAC8&5f=j(B $B>pJs2J3X8&5fIt(B
$BJ,;6%3%s%T%e!<%F%#%s%086M}8&5f%0%k!<%W(B
