Commit bb03b548 authored by jfschaefer's avatar jfschaefer

remove scripts (moved to https://github.com/slatex/lmhtools)

parent f4a7c169
SMGLOM Scripts
===
This folder contains three scripts for analyzing *smglom* based on the *.tex* files in the repositories:
* `smglom_harvest.py` collects information about modules, symbols, verbalizations, ...
* `smglom_debug.py` looks for inconsistencies in the data and prints them (e.g. verbalizations for non-existent symbols)
* `smglom_stats.py` prints statistics about *smglom*
The scripts do not parse *TeX* 'properly'.
Instead, they use regular expressions, which means that the parsing is very limited
and error-prone.
### Requirements
The scripts require at least Python 3.6 and have only been run on Unix systems.
No special libraries should be necessary.
The scripts are run on a local folder that contains the required repositories
from [https://gl.mathhub.info/smglom](https://gl.mathhub.info/smglom).
Note that it does not update (`pull`) the repositories automatically.
### smglom_harvest.py
This script contains the code for collecting data.
The script can be run directly with one of the following commands:
* `repo`: Lists all repositories found.
* `defi`: Lists all the verbalizations found.
* `trefi`: Lists all the `trefi`s found.
* `symi`: Lists all the symbol declarations/definitions found.
* `sigfile`: Lists all the signature files found.
* `langfile`: Lists all the language files found.
For example, the following command (where `../..` is the folder containing all the repositories):
```bash
./smglom_harvest.py defi ../..
```
Prints lines like the following ones:
```
../../mv/source/piecewise.de.tex at 3:28: piecewise?defined-piecewise de "st"uckweise definiert"
../../mv/source/structure.en.tex at 3:9: structure?structure en "mathematical structure"
../../mv/source/structure.en.tex at 4:7: structure?component en "component"
```
The verbosity can be changed with a command-line option (e.g. `-v1`) to reduce the number of errors
shown during the data gathering.
For more information run
```bash
./smglom_harvest.py --help
```
### smglom_debug.py
This script uses the code from `smglom_harvest.py` to gather data and then checks for
inconsistencies.
Depending on the verbosity, more or fewer types of errors are displayed.
Other issues that are not really considered errors can be shown with extra command line options:
* `-ma`: Show missing alignments.
* `-im`: Show missing verbalizations in all existing language files.
* `-mv`: The script prints all missing verbalizations for the languages specified after this argument,
including if a language file is missing for a module.
Examples with the language arguments could be `-mv en de` or `-mv all`.
* `-e`: emacs mode (different formatting of file paths, output directly opened in emacs)
Example call:
```bash
./smglom_debug.py -mv -v2 ../..
```
`-v2` specifies the verbosity.
The output contains general errors like:
```
Verbalization 'multiset' provided multiple times:
../../sets/source/multiset.en.tex at 4:4
../../sets/source/multiset.en.tex at 4:73
../../sets/source/multiset.en.tex at 8:96
```
as well as missing verbalizations, because of the `-im` option:
```
../../mv/source/defeq.en.tex: Missing verbalizations for the following symbols: defequiv, eqdef
../../mv/source/mv.de.tex: Missing verbalizations for the following symbols: biimpl, conj, disj, exis, exisS, foral, foralS, imply, negate, nexis, nexisS, uexis, uexisS
```
Note that several directories can be passed to the script.
For more information run
```bash
./smglom_debug.py --help
```
### smglom_stats.py
This script uses the code from `smglom_harvest.py` to gather data and then prints some statistics.
Example call:
```bash
./smglom_stats.py -v0 ../..
```
Note that several directories can be passed to the script.
`-v0` sets the verbosity to 0, which suppresses errors during data gathering.
Note that errors can skew the statistics. For example, the percentages for each language
indicate what percentage of symbols has a verbalization in that language (ignoring symbols with `noverb` for that language).
This can be more than 100% if there are a lot of verbalizations for symbols
that are not declared in signature files.
For more information run
```bash
./smglom_stats.py --help
```
### Developer notes
The data collection code is in `smglom_harvest.py`.
For simple scripts (like to generate other statistcs)
which do not require changes to the data collection,
this code can be easily imported and used.
Consider the following snippet to get you started:
```python
import smglom_harvest as harvest
PATH = "../.." # directory containing the repositories
VERBOSITY = 1
gatherer = harvest.DataGatherer()
logger = harvest.SimpleLogger(VERBOSITY)
harvest.gather_data_for_all_repos(PATH, harvest.HarvestContext(logger, gatherer))
print(gatherer.defis) # list of dictionaries, each containing the data for one defi
print(gatherer.repos)
print(gatherer.symis)
print(gatherer.trefis)
print(gatherer.sigfiles)
print(gatherer.langfiles)
```
For questions and bug reports, feel free to reach out to [Jan Frederik schaefer](https://kwarc.info/people/jfschaefer/).
This diff is collapsed.
import smglom_harvest as harvest
import sys
PATH = sys.argv[1]
VERBOSITY = 1
gatherer = harvest.DataGatherer()
logger = harvest.SimpleLogger(VERBOSITY)
harvest.gather_data_for_all_repos(PATH, harvest.HarvestContext(logger, gatherer))
for symi in gatherer.symis:
if "gfc" in symi["params"]:
print("I found a symi with gfc:")
print(" ", symi)
for defi in gatherer.defis:
if "gfa" in defi["params"]:
print("I found a defi with gfa:")
print(" ", defi)
#!/usr/bin/env python3
"""
Script for fixing the repository dependencies in META-INF/MANIFEST.MF
"""
import os
import re
import smglom_harvest as harvest
TOKEN_MHINPUTREF = -1
TOKEN_MHGRAPHICS = -2
re_mhinputref = re.compile(
r"\\n?mhinputref\s*"
r"(?:\[(?P<params>[^\]]*)\])?\s*" # parameter
r"\{(?P<arg>" + harvest.re_arg + r")\}" # arg
)
re_mhgraphics = re.compile(
r"\\mhgraphics\s*"
r"(?:\[(?P<params>[^\]]*)\])?\s*" # parameter
r"\{(?P<arg>" + harvest.re_arg + r")\}" # arg
)
REGEXES = [
(harvest.re_guse, harvest.TOKEN_GUSE),
(harvest.re_gimport, harvest.TOKEN_GIMPORT),
(harvest.re_importmhmodule, harvest.TOKEN_IMPORTMHMODULE),
(harvest.re_usemhmodule, harvest.TOKEN_USEMHMODULE),
(re_mhinputref, TOKEN_MHINPUTREF),
(re_mhgraphics, TOKEN_MHGRAPHICS),
]
def gather_repos(path, REPOS):
with open(path, "r") as fp:
string = harvest.preprocess_string(fp.read())
tokens = harvest.parse(string, REGEXES)
for (match, token_type) in tokens:
if token_type in [harvest.TOKEN_GUSE, harvest.TOKEN_GIMPORT, TOKEN_MHINPUTREF]:
# repo is optional argument
repo = match.group("params")
if repo and repo not in REPOS.keys():
REPOS[repo] = f"{path}:{harvest.get_file_pos_str(string, match.start())}: {match.group(0)}"
elif token_type in [harvest.TOKEN_IMPORTMHMODULE, harvest.TOKEN_USEMHMODULE, TOKEN_MHGRAPHICS]:
params = harvest.get_params(match.group("params"))
key = "repos"
if token_type == TOKEN_MHGRAPHICS:
key = "mhrepos"
if key in params.keys():
repo = params[key]
if repo and repo not in REPOS.keys():
REPOS[repo] = f"{path}:{harvest.get_file_pos_str(string, match.start())}: {match.group(0)}"
else:
assert False
def get_olddeps(line):
line = line[len("dependencies:"):]
while line and line[0] == " ":
line = line[1:]
sep = re.compile(r",\s*")
return sep.split(line)
def adjust_manifest(dir_path, REPOS):
new_manifest = ""
found_deps = False
new_line = "dependencies: " + ",".join(REPOS.keys())
with open(os.path.join(dir_path, "../META-INF/MANIFEST.MF"), "r") as fp:
for line in fp:
if line.startswith("dependencies: "):
if found_deps:
print("ERROR: Multiple entries for dependencies found in manifest")
return
old_entries = set(get_olddeps(line[:-1]))
new_entries = set(REPOS.keys())
if old_entries == new_entries:
print("The dependencies are already up-to-date")
return
if new_entries - old_entries:
print("Adding the following dependencies:", ",".join(list(new_entries - old_entries)))
print()
if old_entries - new_entries:
print("Removing the following dependencies:", repr(old_entries - new_entries)) # .join(["'" + s + "'" for s in list(old_entries - new_entries)]))
print()
print("old " + line[:-1])
print("new " + new_line)
new_manifest += new_line + "\n"
found_deps = True
else:
new_manifest += line
if not found_deps:
print()
print("No entry for dependencies found in " + os.path.join(dir_path, "META-INF/MANIFEST.MF"))
print("Appending the following entry:")
print(new_line)
new_manifest += new_line + "\n"
print()
i = input("Do you want to apply these changes? (enter 'y' to confirm): ")
if i == 'y':
with open(os.path.join(dir_path, "../META-INF/MANIFEST.MF"), "w") as fp:
fp.write(new_manifest)
print("Dependecies successfully updated")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Script for fixing repo dependencies in META-INF/MANIFEST.MF",
epilog="Example call: repo_dependencies.py -v0 ../../sets")
parser.add_argument("-v", "--verbosity", type=int, default=1, choices=range(4), help="the verbosity (default: 1)")
parser.add_argument("DIRECTORY", nargs="+", help="git repo or higher level directory for which statistics are generated")
args = parser.parse_args()
if args.verbosity >= 2:
print("GATHERING DATA\n")
logger = harvest.SimpleLogger(args.verbosity)
# determine mathhub folder
mathhub_repo = os.path.abspath(args.DIRECTORY[0])
while not mathhub_repo.endswith("MathHub"):
new = os.path.split(mathhub_repo)[0]
if new == mathhub_repo:
raise Exception("Failed to infer MathHub directory")
mathhub_repo = new
for directory in args.DIRECTORY:
if not os.path.isdir(os.path.join(directory, ".git")): ## TODO: Is there a better way?
raise Exception("'" + directory + "' doesn't appear to be a git repository")
REPOS = {} # repo name to evidence
dir_path = os.path.join(directory, "source")
for root, dirs, files in os.walk(dir_path):
for file_name in files:
if file_name.endswith(".tex"):
gather_repos(os.path.join(root, file_name), REPOS)
for repo in REPOS.keys():
print("I found this dependency:", repo)
print("Evidence:", REPOS[repo])
print()
to_ignore = None
for repo in REPOS.keys():
rp = os.path.abspath(os.path.join(dir_path, "../../..", repo))
if not os.path.isdir(rp):
print("WARNING: I didn't find the directory " + rp)
if directory.endswith(repo):
print("WARNING: It appears that you self-reference the repo:")
print(" " + REPOS[repo])
print(" -> I'm going to ignore this entry")
to_ignore = repo
if to_ignore:
del REPOS[to_ignore]
print()
print()
adjust_manifest(dir_path, REPOS)
This diff is collapsed.
This diff is collapsed.
#!/usr/bin/env python3
"""
Can be used to create statistics about smglom.
This script analyzes the data collected with smglom_harvet.py.
A verbosity level can be set to change the what kind of errors
should be displayed during data collection.
TODO: CREATE TABLE DATA INDEPENDENTLY OF PRESENTATION
"""
import smglom_harvest as harvest
import os
def partition(entries, key):
result = {}
for entry in entries:
k = key(entry)
if k not in result:
result[k] = []
result[k].append(entry)
return result
def unique_list(l):
return sorted(list(set(l)))
def frac2str(a, b):
if b == 0:
return f"{'n/a':>9}"
s = "%.1f" % (100 * a/b)
return f"{s+'%':>9}"
def print_stats(gatherer):
repos = unique_list([e["repo"] for e in gatherer.sigfiles + gatherer.langfiles + gatherer.modules])
langs = unique_list([e["lang"] for e in gatherer.langfiles])
sigf_part = partition(gatherer.sigfiles, lambda e : (e["repo"]))
langf_part = partition(gatherer.langfiles, lambda e : (e["repo"]))
symi_part = partition(gatherer.symis, lambda e : (e["repo"]))
defi_part = partition(gatherer.defis, lambda e : (e["repo"], e["lang"]))
trefi_part = partition(gatherer.trefis, lambda e : e["repo"])
print(f"{'repo':20}{'modules':>9}{'aligned':>9}{'symbols':>9}{'aligned':>9}{'trefis':>9}"+"".join([f"{lang:>9}" for lang in langs])+f"{'views':>9}")
print("-"*(20+9+9+9+9+9+9+9*len(langs)))
for repo in repos:
suffix = ""
aligned_symbols = 0
symbols = 0
if repo in symi_part:
symbols = len(set([(e["mod_name"], e["name"]) for e in symi_part[repo]]))
aligned_symbols = len(set([(e["mod_name"], e["name"]) for e in symi_part[repo] if e["align"] and e["align"] != "noalign"]))
for lang in langs:
if (repo, lang) not in defi_part:
verbs = 0
else:
verbs = len(set([(e["mod_name"], e["name"]) for e in defi_part[(repo, lang)]]))
if repo in symi_part:
symbols_withverb = len(set([(e["mod_name"], e["name"]) for e in symi_part[repo] if e["noverb"] != "all" and lang not in e["noverb"]]))
else:
symbols_withverb = 0
suffix += frac2str(verbs, symbols_withverb)
modsigs = 0
gviewsigs = 0
aligned_modsigs = 0
if repo in sigf_part:
modsigs = len([e for e in sigf_part[repo] if e['type']=='modsig'])
aligned_modsigs = len([e for e in sigf_part[repo] if e['type']=='modsig' and e['align'] and e['align'] != "noalign"])
gviewsigs = len([e for e in sigf_part[repo] if e['type']=='gviewsig'])
trefis = 0
if repo in trefi_part:
trefis = len(trefi_part[repo])
print(f"{repo:20}" +
f"{modsigs:9}" + frac2str(aligned_modsigs, modsigs) +
f"{symbols:9}" + frac2str(aligned_symbols, symbols) +
f"{trefis:9}" +
suffix +
f"{gviewsigs:9}")
print("-"*(20+9+9+9+9+9+9+9*len(langs)))
suffix = ""
symbols = len(set([(e["mod_name"], e["name"]) for e in gatherer.symis]))
aligned_symbols = len(set([(e["mod_name"], e["name"]) for e in gatherer.symis if e["align"] and e["align"] != "noalign"]))
for lang in langs:
verbs = len(set([(e["mod_name"], e["name"]) for e in gatherer.defis if e["lang"] == lang]))
symbols_withverb = len(set([(e["mod_name"], e["name"]) for e in gatherer.symis if e["noverb"] != "all" and lang not in e["noverb"]]))
suffix += frac2str(verbs, symbols_withverb)
modsigs = len([e for e in gatherer.sigfiles if e['type']=='modsig'])
aligned_modsigs = len([e for e in gatherer.sigfiles if e['type']=='modsig' and e["align"] and e["align"] != "noalign"])
print(f"{'TOTAL':20}" +
f"{modsigs:9}" + frac2str(aligned_modsigs, modsigs) +
f"{symbols:9}" + frac2str(aligned_symbols, symbols) +
f"{len(gatherer.trefis):9}" +
suffix +
f"{len([e for e in gatherer.sigfiles if e['type']=='gviewsig']):9}")
def create_csv(gatherer):
repos = unique_list([e["repo"] for e in gatherer.sigfiles + gatherer.langfiles + gatherer.modules])
langs = unique_list([e["lang"] for e in gatherer.langfiles])
sigf_part = partition(gatherer.sigfiles, lambda e : (e["repo"]))
langf_part = partition(gatherer.langfiles, lambda e : (e["repo"]))
symi_part = partition(gatherer.symis, lambda e : (e["repo"]))
defi_part = partition(gatherer.defis, lambda e : (e["repo"], e["lang"]))
trefi_part = partition(gatherer.trefis, lambda e : e["repo"])
with open("stats.csv", "w") as fp:
fp.write("repo, modules, modules aligned, symbols, symbols aligned, total trefis, " + ", ".join([f"coverage {l}" for l in langs]) + ", " + ", ".join([f"synonymity {l}" for l in langs]) + ", views\n")
for repo in repos:
coverages = []
synonymity = []
aligned_symbols = 0
symbols = 0
if repo in symi_part:
symbols = len(set([(e["mod_name"], e["name"]) for e in symi_part[repo]]))
aligned_symbols = len(set([(e["mod_name"], e["name"]) for e in symi_part[repo] if e["align"] and e["align"] != "noalign"]))
for lang in langs:
if (repo, lang) not in defi_part:
verbs = 0
else:
verbs = len(set([(e["mod_name"], e["name"]) for e in defi_part[(repo, lang)]]))
verb_syns = len(set([(e["mod_name"], e["name"], e["string"]) for e in defi_part[(repo, lang)]]))
if repo in symi_part:
symbols_withverb = len(set([(e["mod_name"], e["name"]) for e in symi_part[repo] if e["noverb"] != "all" and lang not in e["noverb"]]))
else:
symbols_withverb = 0
coverages += [str(verbs / symbols_withverb) if symbols_withverb > 0 else "n/a"]
synonymity += [str(verb_syns / verbs) if verbs > 0 else "n/a"]
modsigs = 0
gviewsigs = 0
aligned_modsigs = 0
if repo in sigf_part:
modsigs = len([e for e in sigf_part[repo] if e['type']=='modsig'])
aligned_modsigs = len([e for e in sigf_part[repo] if e['type']=='modsig' and e['align'] and e['align'] != "noalign"])
gviewsigs = len([e for e in sigf_part[repo] if e['type']=='gviewsig'])
trefis = 0
if repo in trefi_part:
trefis = len(trefi_part[repo])
fp.write(f"{repo}, {modsigs}, {aligned_modsigs / modsigs if modsigs else 'n/a'}, "
f"{symbols}, {aligned_symbols / symbols if symbols else 'n/a'}, "
f"{trefis}, {', '.join(coverages)}, {', '.join(synonymity)}, {gviewsigs}\n")
symbols = len(set([(e["mod_name"], e["name"]) for e in gatherer.symis]))
aligned_symbols = len(set([(e["mod_name"], e["name"]) for e in gatherer.symis if e["align"] and e["align"] != "noalign"]))
coverages = []
synonymity = []
for lang in langs:
verbs = len(set([(e["mod_name"], e["name"]) for e in gatherer.defis if e["lang"] == lang]))
verb_syns = len(set([(e["mod_name"], e["name"], e["string"]) for e in gatherer.defis if e["lang"] == lang]))
symbols_withverb = len(set([(e["mod_name"], e["name"]) for e in gatherer.symis if e["noverb"] != "all" and lang not in e["noverb"]]))
coverages += [str(verbs / symbols_withverb) if symbols_withverb > 0 else "n/a"]
synonymity += [str(verb_syns / verbs) if verbs > 0 else "n/a"]
modsigs = len([e for e in gatherer.sigfiles if e['type']=='modsig'])
aligned_modsigs = len([e for e in gatherer.sigfiles if e['type']=='modsig' and e["align"] and e["align"] != "noalign"])
fp.write(f"TOTAL, {modsigs}, {aligned_modsigs / modsigs if modsigs else 'n/a'}, "
f"{symbols}, {aligned_symbols / symbols if symbols else 'n/a'}, "
f"{len(gatherer.trefis)}, {', '.join(coverages)}, {', '.join(synonymity)}, "
f"{len([e for e in gatherer.sigfiles if e['type']=='gviewsig'])}\n")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Script for printing SMGloM statistics",
epilog="Example call: smglom_stats.py -v0 ../..")
parser.add_argument("-v", "--verbosity", type=int, default=1, choices=range(4), help="the verbosity (default: 1)")
parser.add_argument("-c", "--csv", action="store_true", help="generate a CSV table")
parser.add_argument("DIRECTORY", nargs="+", help="git repo or higher level directory for which statistics are generated")
args = parser.parse_args()
if args.verbosity >= 2:
print("GATHERING DATA\n")
logger = harvest.SimpleLogger(args.verbosity)
# determine mathhub folder
mathhub_dir = os.path.abspath(args.DIRECTORY[0])
while not mathhub_dir.endswith("MathHub"):
new = os.path.split(mathhub_dir)[0]
if new == mathhub_dir:
raise Exception("Failed to infer MathHub directory")
mathhub_dir = new
ctx = harvest.HarvestContext(logger, harvest.DataGatherer(), mathhub_dir)
for directory in args.DIRECTORY:
harvest.gather_data_for_all_repos(directory, ctx)
if args.verbosity >= 2 or ctx.something_was_logged:
print("\n\nSTATISTICS\n")
print_stats(ctx.gatherer)
if args.csv:
create_csv(ctx.gatherer)
print("\n\nCreated stats.csv")
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment