Cucat Featurization base #486

tanmoyio · 2023-05-15T15:41:17Z

Starter script

import pandas as pd
import cudf
import graphistry

df = pd.read_csv('https://gist.githubusercontent.com/silkspace/c7b50d0c03dc59f63c48d68d696958ff/raw/31d918267f86f8252d42d2e9597ba6fc03fcdac2/redteam_50k.csv', index_col=0)
red_team = pd.read_csv('https://gist.githubusercontent.com/silkspace/5cf5a94b9ac4b4ffe38904f20d93edb1/raw/888dabd86f88ea747cf9ff5f6c44725e21536465/redteam_labels.csv', index_col=0)
df['feats'] = df.src_computer + ' ' + df.dst_computer + ' ' + df.auth_type + ' ' + df.logontype
tdf = pd.concat([red_team.reset_index(), df.reset_index()])
tdf['node'] = range(len(tdf))


g = graphistry.nodes((tdf))
g1 = g.umap(X=['feats'], feature_engine='cu_cat')
print(g1._node_features)
g2 = g.umap(X=['feats'], feature_engine='dirty_cat')
print(g2._node_features)

silkspace · 2023-05-16T16:22:23Z

graphistry/feature_utils.py

    if y is None:
        return df
    remove_cols = []
    if y is None:
        pass
-    elif isinstance(y, pd.DataFrame):
+    elif isinstance(y, pd.DataFrame) or isinstance(y, cudf.DataFrame):


great catch

@dcolinmorgan or (cudf is not None and isinstance(y, cudf.DataFrame)

maybe same problem elsewhere?

If cudf is None, I think the current would throw an exn

lmeyerov · 2023-06-15T00:49:20Z

@tanmoyio can we close, or this is still live and needs review?

dcolinmorgan · 2023-07-19T02:38:39Z

for cu_cat itself (DT3 branch), I have worked out the dynamic memory handling for T4 v A100 flexibility.
Also worked out datetime passthru. However this needs to bypass cudf dataframing in cu_cat AND pygraphistry so that g.plotter infers datetime correctly to provide time series box
-- currently i accomplish this in a hacky way by binding it to embeddings after transforming but before plotting, thus avoiding cudf requirement
Now I have refactored code to only require gapencoder and tablevectorizer files/functions DT4 branch forked from DT3

lmeyerov · 2023-07-19T07:01:10Z

Awesome - is the plan to start landing, or more first?

And would it make sense to start reviewing any PRs? If a sequence, can you stack them & point out so clear?

dcolinmorgan · 2023-07-21T03:45:17Z

landing would be wonderful -- before end of july is my dream

DT4 is latest cu_cat PR branch which passes many pytests + works as expected in every demo ive done in last few months

lmeyerov · 2023-07-23T00:34:38Z

ok @tanmoyio can you help double check tests, take for a testdrive, and land first in cu_cat and then here?

After, can you help add to main graphistry (https://github.com/graphistry/graphistry/blob/master/compose/dockerfiles/base/05-nvidia.Dockerfile) ? I think we should keep default-off for now, and should test that it's truly default off -- that existence doesn't (yet) trigger it to be used, only explicit use.

dcolinmorgan · 2023-07-26T10:04:09Z

test-full-ai test L395 seems to be getting hung up by 1 of 3 features being exactly reproduced

when first discussing with @silkspace -- this is exactly what we realized approximate estimation would liekly return and user must make sure features make sense, just like with dirty_cat
likely need to test a few so that 2/3 are always reproduced in several estimations rather that 1 case of 3/3 reproduction

lmeyerov · 2023-07-28T13:54:20Z

graphistry/embed_utils.py

-#         return False, object
+def check_cudf():
+    try:
+        import cudf


Who calls this on import?

And can this be a a) cached call that b) checks module path vs an import?

its only test_embed_utils#L14 that calls check_cudf, swapped out for lazy_cudf_import from umap_utils

oh, so that shouldn't be the issue, right? test/* shouldn't get imported by import graphistry..

sorry, no i wasnt clear, test_embed_utils is only OTHER place lazy_cudf_import was present. It was used in embed_utils and imported cudf to check df dtype, but I have swapped it out in place of just checking via getmodule e.g. if 'cudf' in str(getmodule(self._nodes)): , so I believe the problem is solved -- tuna looks much better

lmeyerov · 2023-07-29T06:19:28Z

@silkspace wrt Cucat Featurization base #486 (comment) , may understand better?

graphistry/feature_utils.py

graphistry/embed_utils.py

lmeyerov · 2023-08-02T16:15:13Z

graphistry/feature_utils.py

@@ -62,7 +72,7 @@
    SentenceTransformer = Any
    SuperVectorizer = Any
    GapEncoder = Any
-    SimilarityEncoder = Any
+    # SimilarityEncoder = Any


remove all these dead lines

lmeyerov · 2023-08-02T16:18:00Z

graphistry/feature_utils.py

+            X = np.round(X, decimals=keep_n_decimals)  #  type: ignore  # noqa
+        X = pd.DataFrame(X, columns=columns, index=index)
+    else:
+        X = transformer.fit_transform(X.to_numpy())


how do we know if the transformer is cpu vs gpu? it seems to always assume cpu here, but if X is cudf and transformer is gpu, can't we keep X on gpu?

a sometimes-ok soln would be checking transformer for being from cuml or maybe cu_cat, but that seems non-generalizable

wow this nearly gave me a heart attack -- good thoughts i will work with... but this is an artifact, no .to_numpy needed

lmeyerov · 2023-08-02T16:24:18Z

graphistry/feature_utils.py

+
+
+def make_safe_gpu_dataframes(X, y, engine):
+    has_cudf_dependancy_, _, cudf = lazy_import_has_dependancy_cu_cat()


Add assert cudf is not None ?

also probably good to switch from lazy_import...cu_cat to a cudf one

ok -- after the if statement seems best here again like other assert you mentioned

lmeyerov · 2023-08-02T16:26:52Z

graphistry/feature_utils.py

        yc = y.columns
        xc = df.columns
        for c in yc:
            if c in xc:
                remove_cols.append(c)
-    elif isinstance(y, pd.Series):
+    elif isinstance(y, pd.Series) or isinstance(y, cudf.Series):


handle non-cu_cat import returning None for cudf

lmeyerov · 2023-08-02T16:28:49Z

graphistry/feature_utils.py

+        X = transformer.fit_transform(X.to_numpy())
+        if keep_n_decimals:
+            X = np.round(X, decimals=keep_n_decimals)  #  type: ignore  # noqa
+        _, _, cudf = lazy_import_has_dependancy_cu_cat()


assert cudf is not None

good practice to assert even after if statement check? good to know, yay learning

It's useful when you don't want the 'if' but can imagine future changes or misuses accidentally getting the assumption wrong

graphistry/feature_utils.py

setup.py

umap match transpose index type-spec concat type-spec concat dc for comp_cluster dirty_cat as default, cc passes most tests ;) source cu_cat from pypi source cu_cat from pypi remove cc tests, tested for in dc place remove cc tests, tested for in dc place init 1dc > 2cc init 1dc > 2cc use constants throughout revert from constants revert from constants init 1dc > 2cc better dc default better dc default

tanmoyio added 3 commits May 15, 2023 21:04

cucat feat support

cf07249

cudf test env var added for test_feature_utils.py

d73a2db

some import fixes

382e18b

silkspace reviewed May 16, 2023

View reviewed changes

passthru DT encode/umap, add back for timebar

44200ac

lint

777afd4

lmeyerov assigned dcolinmorgan and tanmoyio Jul 23, 2023

This was referenced Jul 24, 2023

Cudf #445

Closed

include cuCat in ai deps #444

Closed

[BUG] lazy loading regression #481

Open

dcolinmorgan added 3 commits July 26, 2023 18:12

updated cu-cat version for optional install

c1bc6f1

type check without loading cudf, via getmodule

48e4017

ok we still need the check_cudf def

6b0b52b

lmeyerov reviewed Jul 28, 2023

View reviewed changes

swap lazy import defs

e4b0c0a

dcolinmorgan reviewed Aug 2, 2023

View reviewed changes

graphistry/feature_utils.py Show resolved Hide resolved

lmeyerov reviewed Aug 2, 2023

View reviewed changes

graphistry/embed_utils.py Outdated Show resolved Hide resolved

lmeyerov reviewed Aug 2, 2023

View reviewed changes