feat: Kickoff Transformation implementation by HaoXuAI · Pull Request #5130 · feast-dev/feast

HaoXuAI · 2025-03-09T21:05:26Z

What this PR does / why we need it:

Created a Transformation interface. it still works with the current pandas_transformation, python_transformation etc.

The next step is refactor the BatchMaterializationEngine to make it works for both Materialization and Transformation.

Which issue(s) this PR fixes:

#4584
#4277 (comment)
#4696

Misc

franciscojavierarceo

Very nice!! 🚀🚀🚀

franciscojavierarceo · 2025-03-09T23:33:52Z

sdk/python/feast/batch_feature_view.py

+    def get_feature_transformation(self) -> Optional[Transformation]:
+        if not self.udf:
+            return None
+        if self.mode == TransformationMode.pandas or self.mode == "pandas":


Probably can just do a dictionary mapping

sure will add that

franciscojavierarceo · 2025-03-09T23:37:50Z

sdk/python/feast/transformation/base.py

+        udf_string: str = "",
+        tags: Optional[Dict[str, str]] = None,
+        description: str = "",
+        owner: str = "",


Probably can add the singleton parameter too

franciscojavierarceo · 2025-03-09T23:38:12Z

sdk/python/feast/transformation/mode.py

+
+
+class TransformationMode(Enum):
+    PYTHON = "python"


franciscojavierarceo · 2025-03-10T23:33:17Z

sdk/python/feast/batch_feature_view.py

        self,
        *,
        name: str,
+        mode: Union[TransformationMode, str],


franciscojavierarceo · 2025-03-11T00:47:15Z

sdk/python/feast/transformation/mode.py

+class TransformationMode(Enum):
+    PYTHON = "python"
+    PANDAS = "pandas"
+    spark = "spark"


Suggested change

spark = "spark"

SPARK = "spark"

franciscojavierarceo · 2025-03-12T18:12:14Z

Do you think we can scope one type of batch transformation as our MVP? If so, which one would you feel most comfortable doing?

HaoXuAI · 2025-03-12T22:38:45Z

Do you think we can scope one type of batch transformation as our MVP? If so, which one would you feel most comfortable doing?

yep, maybe Spark SQL. that should be easy to implement, and we can get it live soon.

franciscojavierarceo · 2025-03-12T22:58:36Z

Do you think we can scope one type of batch transformation as our MVP? If so, which one would you feel most comfortable doing?

yep, maybe Spark SQL. that should be easy to implement, and we can get it live soon.

LFG!

franciscojavierarceo · 2025-03-21T02:59:04Z

FYI @HaoXuAI a very selfish goal I have is to use a Spark batch transformation to handle scaling embedding documents.

So in an ideal world we could do something like:

def create_embeddings(partitionData):  
  tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")  
  model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")  
  for row in partitionData:  
      document = str(row.document)  
      inputs = tokenizer(document, padding=True, truncation=True, return_tensors="pt", max_length=512)  
      result = model(**inputs)  
      embeddings = result.last_hidden_state[:, 0, :].cpu().detach().numpy()  
      lst = embeddings.flatten().tolist()  
      yield [row.id, lst, "", "{}", None]

And

embeddings = dataset_df.rdd.mapPartitions(create_embeddings)

Used in a batch feature view and an on demand feature view (on write) to support offline and online processing of docs for RAG. The key thing here is that we'd be able to really scale RAG offline embedding and make the transition seamless to online.

This example is from the Pinecone docs here: https://docs.pinecone.io/integrations/databricks#3-create-the-vector-embeddings

HaoXuAI · 2025-03-21T07:46:27Z

FYI @HaoXuAI a very selfish goal I have is to use a Spark batch transformation to handling scaling embedding documents.

So in an ideal world we could do something like:
def create_embeddings(partitionData):  
  tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")  
  model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")  
  for row in partitionData:  
      document = str(row.document)  
      inputs = tokenizer(document, padding=True, truncation=True, return_tensors="pt", max_length=512)  
      result = model(**inputs)  
      embeddings = result.last_hidden_state[:, 0, :].cpu().detach().numpy()  
      lst = embeddings.flatten().tolist()  
      yield [row.id, lst, "", "{}", None]
And
embeddings = dataset_df.rdd.mapPartitions(create_embeddings)  
Used in a batch feature view and an on demand feature view (on write) to support offline and online processing of docs for RAG. The key thing here is that we'd be able to really scale RAG offline embedding and make the transition seamless to online.

This example is from the Pinecone docs here: https://docs.pinecone.io/integrations/databricks#3-create-the-vector-embeddings

This seems to be a really good example of the BatchFeatureView + transformation

HaoXuAI · 2025-03-22T22:26:08Z

PR diverged, cherrypick at #5181

HaoXuAI requested review from a team as code owners March 9, 2025 21:05

HaoXuAI requested review from shuchu, tchughesiv and tokoko March 9, 2025 21:08

franciscojavierarceo reviewed Mar 9, 2025

View reviewed changes

HaoXuAI changed the title ~~[DRAFT] Transformation skeleton~~ feat: Transformation skeleton [DRAFT] Mar 10, 2025

HaoXuAI added the ok-to-test label Mar 10, 2025

franciscojavierarceo reviewed Mar 10, 2025

View reviewed changes

sdk/python/feast/batch_feature_view.py Outdated

self,

*,

name: str,

mode: Union[TransformationMode, str],

Copy link

Member

franciscojavierarceo Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

franciscojavierarceo reviewed Mar 11, 2025

View reviewed changes

HaoXuAI changed the title ~~feat: Transformation skeleton [DRAFT]~~ feat: Transformation skeleton Mar 12, 2025

HaoXuAI changed the title ~~feat: Transformation skeleton~~ feat: Kickoff Transformation implementation Mar 12, 2025

squash to one commit on yourBranch

e925038

HaoXuAI force-pushed the transformation-1 branch from e5cb88b to e925038 Compare March 22, 2025 22:09

HaoXuAI requested a review from sudohainguyen as a code owner March 22, 2025 22:09

HaoXuAI closed this Mar 22, 2025

HaoXuAI mentioned this pull request Mar 22, 2025

feat: Kickoff Transformation implementationtransformation code base #5181

Merged

HaoXuAI deleted the transformation-1 branch March 22, 2025 22:32

Conversation

HaoXuAI commented Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Misc

Uh oh!

franciscojavierarceo left a comment

Choose a reason for hiding this comment

Uh oh!

franciscojavierarceo Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

HaoXuAI Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

franciscojavierarceo Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

franciscojavierarceo Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

franciscojavierarceo Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

franciscojavierarceo Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

franciscojavierarceo commented Mar 12, 2025

Uh oh!

HaoXuAI commented Mar 12, 2025

Uh oh!

franciscojavierarceo commented Mar 12, 2025

Uh oh!

franciscojavierarceo commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HaoXuAI commented Mar 21, 2025

Uh oh!

HaoXuAI commented Mar 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

HaoXuAI commented Mar 9, 2025 •

edited

Loading

franciscojavierarceo commented Mar 21, 2025 •

edited

Loading